Handling Missing Values

Once again, part of project work at IIITB, we were tasked to find out best startup to invest. Not going into details of project, instead I would like to pickup a specific problem where we were supposed to handle missing values.

Only fundingamount feature is required for our analysis.

No of Observations: 114949 with 19990 missing values or NAs.

image

Summary statistics: summary(missingValues$fundingamount)

Min.   1st Qu.    Median      Mean   3rd Qu.      Max.      NA's 
0.000e+00 3.225e+05 1.681e+06 1.043e+07 7.000e+06 2.127e+10     19990

To look at missing data across all features and observations, “mice” package may be used.

Multi feature missing observations:

md.pattern(missingValues)

image

It is cross tab between features and number of observations missing. From above result

  • fundingamount is present in 94959 (value 1) and missing in 19990 (value 0)
  • fundingtype is present in all observations.

Edge columns (row number 3 and column V3) are best understand using binary math.

  • fundingtype it has 1 for both 94959 and 19990. Adding 1 to 1 in binary gives 10 take 0.
  • Thus missing values for fundingtype is 0 implies there are no missing values.
  • Similar is computation is done across rows
    • 94959 observations have values both for fundingtype and fundingamount (thus missing (0))

Visually interpreting missing data:

aggr_plot <- aggr(
  missingValues, col=c(“navyblue”,’red’),
  numbers = T, sortVars = T,
  labels = names(missingValues),
  cex.axis = 7, gap = 3,
  ylab = c(“History of missing data”, “Pattern”)
  )

Output:

 Variables sorted by number of missings: 
      Variable     Count
 fundingamount 0.1739032
   fundingtype 0.0000000

Graphical Result:

image

As there 17% missing values in funding amount of total 114 K rows, ignoring NA rows is not an optimal solution. Instead requirement is to impute values in place of NA. Now challenge is what values it impute.

As feature is a ratio variable options for imputing include:

  • Use 0
  • Use Mean  , Median
  • Compute mean, after outlier clippings
  • Compute mean, after Boxplot and find outliers and clip
  • Feature transformation
  • Using MICE / PCA (Considering only 2 features these will not be done)

Computing base mean and density distribution:

Let us compute base mean after removing NA values. This base mean is impacted by outliers on either sides.

  • plot(density(filter(missingValues,!is.na(fundingamount))$fundingamount),col = “red”,main=”Dist. Rasied Amount”)

Notice red line that is parallel to Y axis. It is a extreme right skewed distribution, where there are few fundingamounts equal to 2 billion but almost most of them are near to 0. Also above summary output reinforces that Min value = 0, Max Value = 2 Billion dollars.

image[22]

  • boxplot(filter(missingValues,!is.na(fundingamount))$fundingamount,main=”Dist. Rasied Amount”)

Below box plot also show data is skewed and there seems to only on

image

  • floor(mean(filter(missingValues,!is.na(fundingamount))$fundingamount))
Base Mean: 10,426,869 (10.42 M Dollars)

Clipping Single Outlier values from Top and Bottom:

As outlier impact mean, one simple method we used was to clip single outlier values (observations could be many if equal to value) both from top and bottom of mean. And then we computed mean again. Notice even after we removed single outlier values on either side our distribution is extremely skewed. Only mean reduced by 20 million.

  • outlier(filter(missingValues,!is.na(fundingamount))$fundingamount) –> maxOutlier
  • outlier(filter(missingValues,!is.na(fundingamount))$fundingamount,opposite = TRUE) –> minOutlier
  • plot(density(filter(missingValues, (!(is.na(fundingamount) & (fundingamount) > minOutlier) & (fundingamount)<maxOutlier))$fundingamount),main=”Dist. fundingamount with one outlier removed”,col=”red”)

image

boxplot(filter(missingValues, (!(is.na(fundingamount) & (fundingamount) > minOutlier) & (fundingamount)<maxOutlier))$fundingamount,main=”Dist. fundingamount with one outlier removed”)

image

  • mean(filter(missingValues, (!(is.na(fundingamount) & (fundingamount) > minOutlier) & (fundingamount)<maxOutlier))$fundingamount)
Mean with Single Outliers values clipped 10202965 10.20 M USD

Clipping Outlier values Using Box Plot:

BoxPlot provides a heuristics where any point away 1.5 times whiskers on either side is considered outlier. Using that we removed outliers again. Now we saw data move slightly towards normal form. We could have adjusted 1.5 to 1.2 or even less but with that our number of rows would come down. But nevertheless it showed we were on right track.

    • outliervalues <- boxplot.stats(missingValues$fundingamount)$out
  • plot( density ( filter ( missingValues, !(fundingamount %in% outliervalues) & ! is.na(fundingamount))$fundingamount), main = “Dist. With outliers removed using Box Plot”,col=”red”)

image

  • boxplot((filter(missingValues,!(fundingamount %in% outliervalues) & !is.na(fundingamount))$fundingamount), main = “Dist. With outliers removed using Box Plot”,col=”red”)

image

  • floor(mean(filter(missingValues,!(fundingamount %in% outliervalues) & !is.na(fundingamount))$fundingamount))
Mean after outlier clipping using Boxplot: 3064248 (3.06 M Dollars)

Log Transformation:

We did log transformation of variable and plotted. Then transformed data for approximately normal. Used this to find mean and applied antilog to get mean value. This value was used to replace NA. One thing post replacement of NA with mean arrived in this method did not much alter shape of original distribution.

  • plot(density(log(filter(missingValues,!is.na(fundingamount) & fundingamount > 0 )$fundingamount)),main=”Dist. Log(fundingamount)”,col=”red”)

image

  • boxplot((log(filter(missingValues,!is.na(fundingamount) & fundingamount > 0 )$fundingamount)),main=”Dist. Log(fundingamount)”,col=”red”)

image

  • floor(exp(mean(log(filter(missingValues,!is.na(fundingamount) & fundingamount > 0 )$fundingamount))))
Mean After Log Transformation: 1422627 (1.42 Million Dollars)

Means across various methods:

Method Mean
Base 10.42 M USD
Single Outlier Values Clipped 10.20 M USD
Outliers clipped with Box Plot 3.06 M USD
Log Transformation 1.42 M USD

Summary: As data is completely skewed, mean for such distribution is not apt representative and outlier clipped though helped to some extent did not remove data skew as Log transformation achieved. Based on this in our project we have considered Log transformation as optimal and used that to compute mean that was used to replace NA/ missing values.

If you feel we could have tried other options as well please let us know through comments section.

Until next learning…

-Guru

Configuring Jupyter for public, remote access

 

As a part of data analytics course project for IIITB, a team as formed and a group project was our target. The team we had was spread across (3 in Bangalore and 1 US). So, I thought best option for doing a collaborative project was using Notebooks. Though worked in a limited way using Jupyter Notebooks, never tried with a public facing end point where any user with right credentials can login and work.

Below are steps to enable Jupyter Notebooks for shared , collaborative analytics experience.

Note: Did not use Jupyter Hub for multi User Server (though ideal would want to setup simple and fast for our projects

Steps:

  • Create Virtual Machine using Azure Portal  (Or any public cloud or a machine with public end point)
    • 4 Core, 16 GB RAM, Linux OR
    • 4 Core, 16 GB RAM, Windows 2012

Though basic R and R Studio are not required for this, we have installed nevertheless for any local server testing and validations.

  • Installed R Studio.
  • Conda is package and environment management system.
    • Using Conda multiple version of R / Python can be run simultaneously without impacting each other environment.
    • Conda can installed using either Anaconda / Miniconda.
  • Installed conda with Miniconda.
  • As default conda package management does not include R environment, we created a new R environment using R essentials
  • After installation is complete, open command prompt and run below command to create R Environment
    • conda install -c r r-essentials=1.5.2
    • This command takes time to install R Packages.
  • Test it by running command below to see if Jupyter Notebooks are properly installed
    • jupyter notebook
  • It should open a browser and open http://localhost:8888/tree where we will be able to create and R notebook.

image

Next steps enable netbook server side for public access.

  • Open command prompt and type below command to create new configuration
    • jupyter notebook –generate-config
    • It gives path where configuration file is stored.
  • Run command to create password
    • jupyter notebook password
    • As is with previous command, returns path where password hash is stored and it will be in same directory as configuration file.
  • In configuration file do below,
    • Search c.NotebookApp.password and replace hash that is present in password hash file created.
    • Search c.NotebookApp.ip, remove # (uncomment) from front and put ‘*’ if users can connect from anywhere. As we are opening server for public access with only password, recommendation is to put a very strong password.
    • Search c.NotebookApp.allow_origin and remove # and set it to ‘*’.
    • Search c.NotebookApp.port and set it to port 8888
  • On Local Server, enable firewall exceptions to allow notebook web layers and kernel servers to communicate. We did open all ports for 127.0.0.1 for both incoming and outgoing. Also if public cloud, Network Security Groups setting have to be configured to allow data for port 8888.
  • Shutdown and restart jupyter netbooks and we are set to collaborate data science project.

While configuring we realized that there is lot more Conda (Anaconda / Miniconda) and would have to dwell much much deeper into how internally these things work and best practices one has to adhere to for a large scale deployment. But for now quite happy to get this started in a matter 30 minutes.

References: