Handling Missing Values

Once again, part of project work at IIITB, we were tasked to find out best startup to invest. Not going into details of project, instead I would like to pickup a specific problem where we were supposed to handle missing values.

Only fundingamount feature is required for our analysis.

No of Observations: 114949 with 19990 missing values or NAs.


Summary statistics: summary(missingValues$fundingamount)

Min.   1st Qu.    Median      Mean   3rd Qu.      Max.      NA's 
0.000e+00 3.225e+05 1.681e+06 1.043e+07 7.000e+06 2.127e+10     19990

To look at missing data across all features and observations, “mice” package may be used.

Multi feature missing observations:



It is cross tab between features and number of observations missing. From above result

  • fundingamount is present in 94959 (value 1) and missing in 19990 (value 0)
  • fundingtype is present in all observations.

Edge columns (row number 3 and column V3) are best understand using binary math.

  • fundingtype it has 1 for both 94959 and 19990. Adding 1 to 1 in binary gives 10 take 0.
  • Thus missing values for fundingtype is 0 implies there are no missing values.
  • Similar is computation is done across rows
    • 94959 observations have values both for fundingtype and fundingamount (thus missing (0))

Visually interpreting missing data:

aggr_plot <- aggr(
  missingValues, col=c(“navyblue”,’red’),
  numbers = T, sortVars = T,
  labels = names(missingValues),
  cex.axis = 7, gap = 3,
  ylab = c(“History of missing data”, “Pattern”)


 Variables sorted by number of missings: 
      Variable     Count
 fundingamount 0.1739032
   fundingtype 0.0000000

Graphical Result:


As there 17% missing values in funding amount of total 114 K rows, ignoring NA rows is not an optimal solution. Instead requirement is to impute values in place of NA. Now challenge is what values it impute.

As feature is a ratio variable options for imputing include:

  • Use 0
  • Use Mean  , Median
  • Compute mean, after outlier clippings
  • Compute mean, after Boxplot and find outliers and clip
  • Feature transformation
  • Using MICE / PCA (Considering only 2 features these will not be done)

Computing base mean and density distribution:

Let us compute base mean after removing NA values. This base mean is impacted by outliers on either sides.

  • plot(density(filter(missingValues,!is.na(fundingamount))$fundingamount),col = “red”,main=”Dist. Rasied Amount”)

Notice red line that is parallel to Y axis. It is a extreme right skewed distribution, where there are few fundingamounts equal to 2 billion but almost most of them are near to 0. Also above summary output reinforces that Min value = 0, Max Value = 2 Billion dollars.


  • boxplot(filter(missingValues,!is.na(fundingamount))$fundingamount,main=”Dist. Rasied Amount”)

Below box plot also show data is skewed and there seems to only on


  • floor(mean(filter(missingValues,!is.na(fundingamount))$fundingamount))
Base Mean: 10,426,869 (10.42 M Dollars)

Clipping Single Outlier values from Top and Bottom:

As outlier impact mean, one simple method we used was to clip single outlier values (observations could be many if equal to value) both from top and bottom of mean. And then we computed mean again. Notice even after we removed single outlier values on either side our distribution is extremely skewed. Only mean reduced by 20 million.

  • outlier(filter(missingValues,!is.na(fundingamount))$fundingamount) –> maxOutlier
  • outlier(filter(missingValues,!is.na(fundingamount))$fundingamount,opposite = TRUE) –> minOutlier
  • plot(density(filter(missingValues, (!(is.na(fundingamount) & (fundingamount) > minOutlier) & (fundingamount)<maxOutlier))$fundingamount),main=”Dist. fundingamount with one outlier removed”,col=”red”)


boxplot(filter(missingValues, (!(is.na(fundingamount) & (fundingamount) > minOutlier) & (fundingamount)<maxOutlier))$fundingamount,main=”Dist. fundingamount with one outlier removed”)


  • mean(filter(missingValues, (!(is.na(fundingamount) & (fundingamount) > minOutlier) & (fundingamount)<maxOutlier))$fundingamount)
Mean with Single Outliers values clipped 10202965 10.20 M USD

Clipping Outlier values Using Box Plot:

BoxPlot provides a heuristics where any point away 1.5 times whiskers on either side is considered outlier. Using that we removed outliers again. Now we saw data move slightly towards normal form. We could have adjusted 1.5 to 1.2 or even less but with that our number of rows would come down. But nevertheless it showed we were on right track.

    • outliervalues <- boxplot.stats(missingValues$fundingamount)$out
  • plot( density ( filter ( missingValues, !(fundingamount %in% outliervalues) & ! is.na(fundingamount))$fundingamount), main = “Dist. With outliers removed using Box Plot”,col=”red”)


  • boxplot((filter(missingValues,!(fundingamount %in% outliervalues) & !is.na(fundingamount))$fundingamount), main = “Dist. With outliers removed using Box Plot”,col=”red”)


  • floor(mean(filter(missingValues,!(fundingamount %in% outliervalues) & !is.na(fundingamount))$fundingamount))
Mean after outlier clipping using Boxplot: 3064248 (3.06 M Dollars)

Log Transformation:

We did log transformation of variable and plotted. Then transformed data for approximately normal. Used this to find mean and applied antilog to get mean value. This value was used to replace NA. One thing post replacement of NA with mean arrived in this method did not much alter shape of original distribution.

  • plot(density(log(filter(missingValues,!is.na(fundingamount) & fundingamount > 0 )$fundingamount)),main=”Dist. Log(fundingamount)”,col=”red”)


  • boxplot((log(filter(missingValues,!is.na(fundingamount) & fundingamount > 0 )$fundingamount)),main=”Dist. Log(fundingamount)”,col=”red”)


  • floor(exp(mean(log(filter(missingValues,!is.na(fundingamount) & fundingamount > 0 )$fundingamount))))
Mean After Log Transformation: 1422627 (1.42 Million Dollars)

Means across various methods:

Method Mean
Base 10.42 M USD
Single Outlier Values Clipped 10.20 M USD
Outliers clipped with Box Plot 3.06 M USD
Log Transformation 1.42 M USD

Summary: As data is completely skewed, mean for such distribution is not apt representative and outlier clipped though helped to some extent did not remove data skew as Log transformation achieved. Based on this in our project we have considered Log transformation as optimal and used that to compute mean that was used to replace NA/ missing values.

If you feel we could have tried other options as well please let us know through comments section.

Until next learning…


Wordcloud using R

Wordcloud is cloud of words (visualization technique) that gives prominence to words based on frequency of each word usage from source text. Here is one example how to create wordcloud using R.

Exercise: Read comments from news article / blog and check trending key words for that topic(s) comments.


  • Iterative process to remove stop words as “tm” library tm_map(x,content_transformer(removeStopwords),stopwords()) does not remove all stop words.
  • content_transfomer() function needs to be used mainly in later version of R. Function such as tolower() of “tm” library return different data type.
  • Standalone wordcloud reflects frequently used words but does not reflect sentiments about them. Wordcloud should be used along with sentimental analysis of posts.

Sample comments below:

  1. We should first take care of all connectivity issues in remote areas. Then we should make the rural masses aware of what their rights are and how can they use it. We should make a simple user friendly app wherein the rural masses will get all-in-one assistance (agriculture,health, animal care,bus time table, online mandi trading and online banking facility.
  2. In my opinion, a citizen in rural india who opens a bank account under the PRADHAN MANTRI JAN DHAN YOJNA should be given a free sim card which has internet facility activated by default and lifetime validity.This would encuorage more and more people to open accounts and will lead them with access to mobile facillty,Interet Facility.

R Code:

  • install.packages(“tm”)
  • library(“tm”)
  • tp <- read.csv(“D:/Customer/Srini/TP.csv”,sep=”,”,header=TRUE, stringsAsFactors =FALSE)
  • tpPost <- tp$Post
  • tpc <- Corpus(VectorSource(tpPost))
  • tpc <- tm_map(tpc,content_transformer(removePunctuation))
  • tpc <- tm_map(tpc,content_transfomer(removeNumbers))
  • tpc <- tm_map(tpc,content_transformer(tolower))
  • tpc <- tm_map(tpc,content_transformer(removeWords),stopwords(“English”))
  • tpc <- tm_map(tpc,content_transformer(stripWhitespaces))
  • inspect(tpc[1:10])
  • stopWords <- c(“can”,”make”,”get”,”need”,”use”,”sir”,”like”,”also”,”open”,”call”,”netfor”,”for”,”new”,”ca”,”issued”,”become”,”cscs”,”will”,”lack”,”first”,”care”,”take”)
  • tpc <- tm_map(tpc,content_transformer(removeWords),stopWords)
  • tpc <- tm_map(tpc,content_transformer(stripWhitespace))
  • inspect(tpc[1:10])
  • wordcloud(tpc,min.freq=200,max.words=Inf,random.order=FALSE)


  • pal2 <- brewer.pal(8,”Dark2″)
  • wordcloud(tpc,min.freq=200,max.words=Inf,random.order=FALSE,rot.per=.15,colors=pal2)


  • wordcloud(tpc,min.freq=200,max.words=Inf,random.order=FALSE,scale = c(8,.2),rot.per=.15,colors=pal2)


Until next time…