Handling Missing Values

Once again, part of project work at IIITB, we were tasked to find out best startup to invest. Not going into details of project, instead I would like to pickup a specific problem where we were supposed to handle missing values.

Only fundingamount feature is required for our analysis.

No of Observations: 114949 with 19990 missing values or NAs.

image

Summary statistics: summary(missingValues$fundingamount)

Min.   1st Qu.    Median      Mean   3rd Qu.      Max.      NA's 
0.000e+00 3.225e+05 1.681e+06 1.043e+07 7.000e+06 2.127e+10     19990

To look at missing data across all features and observations, “mice” package may be used.

Multi feature missing observations:

md.pattern(missingValues)

image

It is cross tab between features and number of observations missing. From above result

  • fundingamount is present in 94959 (value 1) and missing in 19990 (value 0)
  • fundingtype is present in all observations.

Edge columns (row number 3 and column V3) are best understand using binary math.

  • fundingtype it has 1 for both 94959 and 19990. Adding 1 to 1 in binary gives 10 take 0.
  • Thus missing values for fundingtype is 0 implies there are no missing values.
  • Similar is computation is done across rows
    • 94959 observations have values both for fundingtype and fundingamount (thus missing (0))

Visually interpreting missing data:

aggr_plot <- aggr(
  missingValues, col=c(“navyblue”,’red’),
  numbers = T, sortVars = T,
  labels = names(missingValues),
  cex.axis = 7, gap = 3,
  ylab = c(“History of missing data”, “Pattern”)
  )

Output:

 Variables sorted by number of missings: 
      Variable     Count
 fundingamount 0.1739032
   fundingtype 0.0000000

Graphical Result:

image

As there 17% missing values in funding amount of total 114 K rows, ignoring NA rows is not an optimal solution. Instead requirement is to impute values in place of NA. Now challenge is what values it impute.

As feature is a ratio variable options for imputing include:

  • Use 0
  • Use Mean  , Median
  • Compute mean, after outlier clippings
  • Compute mean, after Boxplot and find outliers and clip
  • Feature transformation
  • Using MICE / PCA (Considering only 2 features these will not be done)

Computing base mean and density distribution:

Let us compute base mean after removing NA values. This base mean is impacted by outliers on either sides.

  • plot(density(filter(missingValues,!is.na(fundingamount))$fundingamount),col = “red”,main=”Dist. Rasied Amount”)

Notice red line that is parallel to Y axis. It is a extreme right skewed distribution, where there are few fundingamounts equal to 2 billion but almost most of them are near to 0. Also above summary output reinforces that Min value = 0, Max Value = 2 Billion dollars.

image[22]

  • boxplot(filter(missingValues,!is.na(fundingamount))$fundingamount,main=”Dist. Rasied Amount”)

Below box plot also show data is skewed and there seems to only on

image

  • floor(mean(filter(missingValues,!is.na(fundingamount))$fundingamount))
Base Mean: 10,426,869 (10.42 M Dollars)

Clipping Single Outlier values from Top and Bottom:

As outlier impact mean, one simple method we used was to clip single outlier values (observations could be many if equal to value) both from top and bottom of mean. And then we computed mean again. Notice even after we removed single outlier values on either side our distribution is extremely skewed. Only mean reduced by 20 million.

  • outlier(filter(missingValues,!is.na(fundingamount))$fundingamount) –> maxOutlier
  • outlier(filter(missingValues,!is.na(fundingamount))$fundingamount,opposite = TRUE) –> minOutlier
  • plot(density(filter(missingValues, (!(is.na(fundingamount) & (fundingamount) > minOutlier) & (fundingamount)<maxOutlier))$fundingamount),main=”Dist. fundingamount with one outlier removed”,col=”red”)

image

boxplot(filter(missingValues, (!(is.na(fundingamount) & (fundingamount) > minOutlier) & (fundingamount)<maxOutlier))$fundingamount,main=”Dist. fundingamount with one outlier removed”)

image

  • mean(filter(missingValues, (!(is.na(fundingamount) & (fundingamount) > minOutlier) & (fundingamount)<maxOutlier))$fundingamount)
Mean with Single Outliers values clipped 10202965 10.20 M USD

Clipping Outlier values Using Box Plot:

BoxPlot provides a heuristics where any point away 1.5 times whiskers on either side is considered outlier. Using that we removed outliers again. Now we saw data move slightly towards normal form. We could have adjusted 1.5 to 1.2 or even less but with that our number of rows would come down. But nevertheless it showed we were on right track.

    • outliervalues <- boxplot.stats(missingValues$fundingamount)$out
  • plot( density ( filter ( missingValues, !(fundingamount %in% outliervalues) & ! is.na(fundingamount))$fundingamount), main = “Dist. With outliers removed using Box Plot”,col=”red”)

image

  • boxplot((filter(missingValues,!(fundingamount %in% outliervalues) & !is.na(fundingamount))$fundingamount), main = “Dist. With outliers removed using Box Plot”,col=”red”)

image

  • floor(mean(filter(missingValues,!(fundingamount %in% outliervalues) & !is.na(fundingamount))$fundingamount))
Mean after outlier clipping using Boxplot: 3064248 (3.06 M Dollars)

Log Transformation:

We did log transformation of variable and plotted. Then transformed data for approximately normal. Used this to find mean and applied antilog to get mean value. This value was used to replace NA. One thing post replacement of NA with mean arrived in this method did not much alter shape of original distribution.

  • plot(density(log(filter(missingValues,!is.na(fundingamount) & fundingamount > 0 )$fundingamount)),main=”Dist. Log(fundingamount)”,col=”red”)

image

  • boxplot((log(filter(missingValues,!is.na(fundingamount) & fundingamount > 0 )$fundingamount)),main=”Dist. Log(fundingamount)”,col=”red”)

image

  • floor(exp(mean(log(filter(missingValues,!is.na(fundingamount) & fundingamount > 0 )$fundingamount))))
Mean After Log Transformation: 1422627 (1.42 Million Dollars)

Means across various methods:

Method Mean
Base 10.42 M USD
Single Outlier Values Clipped 10.20 M USD
Outliers clipped with Box Plot 3.06 M USD
Log Transformation 1.42 M USD

Summary: As data is completely skewed, mean for such distribution is not apt representative and outlier clipped though helped to some extent did not remove data skew as Log transformation achieved. Based on this in our project we have considered Log transformation as optimal and used that to compute mean that was used to replace NA/ missing values.

If you feel we could have tried other options as well please let us know through comments section.

Until next learning…

-Guru

Configuring Jupyter for public, remote access

 

As a part of data analytics course project for IIITB, a team as formed and a group project was our target. The team we had was spread across (3 in Bangalore and 1 US). So, I thought best option for doing a collaborative project was using Notebooks. Though worked in a limited way using Jupyter Notebooks, never tried with a public facing end point where any user with right credentials can login and work.

Below are steps to enable Jupyter Notebooks for shared , collaborative analytics experience.

Note: Did not use Jupyter Hub for multi User Server (though ideal would want to setup simple and fast for our projects

Steps:

  • Create Virtual Machine using Azure Portal  (Or any public cloud or a machine with public end point)
    • 4 Core, 16 GB RAM, Linux OR
    • 4 Core, 16 GB RAM, Windows 2012

Though basic R and R Studio are not required for this, we have installed nevertheless for any local server testing and validations.

  • Installed R Studio.
  • Conda is package and environment management system.
    • Using Conda multiple version of R / Python can be run simultaneously without impacting each other environment.
    • Conda can installed using either Anaconda / Miniconda.
  • Installed conda with Miniconda.
  • As default conda package management does not include R environment, we created a new R environment using R essentials
  • After installation is complete, open command prompt and run below command to create R Environment
    • conda install -c r r-essentials=1.5.2
    • This command takes time to install R Packages.
  • Test it by running command below to see if Jupyter Notebooks are properly installed
    • jupyter notebook
  • It should open a browser and open http://localhost:8888/tree where we will be able to create and R notebook.

image

Next steps enable netbook server side for public access.

  • Open command prompt and type below command to create new configuration
    • jupyter notebook –generate-config
    • It gives path where configuration file is stored.
  • Run command to create password
    • jupyter notebook password
    • As is with previous command, returns path where password hash is stored and it will be in same directory as configuration file.
  • In configuration file do below,
    • Search c.NotebookApp.password and replace hash that is present in password hash file created.
    • Search c.NotebookApp.ip, remove # (uncomment) from front and put ‘*’ if users can connect from anywhere. As we are opening server for public access with only password, recommendation is to put a very strong password.
    • Search c.NotebookApp.allow_origin and remove # and set it to ‘*’.
    • Search c.NotebookApp.port and set it to port 8888
  • On Local Server, enable firewall exceptions to allow notebook web layers and kernel servers to communicate. We did open all ports for 127.0.0.1 for both incoming and outgoing. Also if public cloud, Network Security Groups setting have to be configured to allow data for port 8888.
  • Shutdown and restart jupyter netbooks and we are set to collaborate data science project.

While configuring we realized that there is lot more Conda (Anaconda / Miniconda) and would have to dwell much much deeper into how internally these things work and best practices one has to adhere to for a large scale deployment. But for now quite happy to get this started in a matter 30 minutes.

References:

Architecting BI solution..

As part of discussion forum, there was below question from my peer. A very valid question considering options one has to consider before building an “End to End BI solution”.

How DW process works, how data feeds to the DW, how batch job set up to feed to DW, who exactly creates star schema. In one line end-end process till data stores in DW. Then how ET tool pulls data from DW. And request was not to USE GOOGLE LINKS as answers.

Now this questions can not be answered in that forum.. so converted to a blog post (and also help me with more traffic Winking smile). There are series of blog posts that cover from Source Systems –> Building BI Systems –> Predicting Employee Attrition. And all along assets will be uploaded to github if anyone wants to reuse. Also, adhering to general practice, let us understand BI solution from architectural , design and development stand point to appreciate knowhow at each layer. This post details “Architecting BI Solution”.

Below depicted is “Conceptual BI architecture” that is most generic representation of BI solution on any platform and technology.

BI Conceptual Architecture

At an architectural level, below are questions that one need to ask.

Sourcing Data:

Regarding source data few of questions

Access Methods:

High degree of variation in terms of how data can be extracted from source systems. Categorize application and identify different methodologies to pull data.

  • Home grown application, generally extraction at data layer is possible.
  • ERP software like SAP, there is NO way to extract data from DB layer. Instead such application provide API and other access methodologies to pull data. One need to use  only those to pull data.
  • Cloud based SAAS solution only have API methods to pull data.

Generally extraction at application / middle tier layer is ONLY option for Third Party / External applications.

Data Formats: Different applications provide different data formats. If accessing DB directly it becomes straight forward but generally one ends up using different access methods as afore mentioned. When extracting data from API layer from source systems variations in data format will arise.

  • Any RDBMS structured data.
  • API Layer extraction
    • Different file formats (CSV / TSV).
    • JSON files

Data Volumes: Next aspect architect needs to be aware, is volume of delta / differential data.

  • Does source system provide methods to pull delta data?
  • If so, data volume (in terms of GBs / TBs) of differential daily extraction.

Other points to consider would be, Network Throughput, impact of extraction on performance of source systems, Security aspects (authorization / auditing), allowable staleness of data in DWH ( D – 1).

Once architect has clear picture of various sources and related inputs, next question one has to answer is should data be hosted in a ODS (Operational Data Store) or can it be directly loaded into dimensional models (Start / Snowflake Schema).

Requirements for Operational Data Stores: Operational data stores retain same schema as source systems and there is no schema level impedance between Source System (Relational Models) and ODS Schema. Only difference is ODS holds data from all source systems for a limited period of them before loaded into models of data warehouse.

  • Source System: Normalized to store data model for that specific application.
  • ODS: Normalized data models, hosted for all source systems.
  • DWH: De-Normalized data models (as designed by data models).

To understand if ODS is required in a BI solution, these questions needs to be answered.

  • Does current OLTP system support reporting or are systems too stressed (in terms for resources) that running reports will impact business transactions.
  • For running reports, does data need to be integrated with other transaction systems. Like for example to understand a customer journey end to end on a site, not just Order System but Web Log transaction systems need to integrated to understand customer journey. Similar is case of taking inventory stock across various stores..

But be careful about ODS, not many people like it and may be because “Data Mart”, “Data Warehouse”, “ODS”, “Report Data Store” are used interchangeably. But if ODS is built, it becomes source for Data Warehouse.

 

ETL:

Data extraction , transformation and load enables to move data from Source Systems to either ODS (if built) or to data models in Data Warehouse.  A generic architecture data flow architecture for a BI solution depicted below. Notice those highlighted in yellow indicating ETL layer.

Data Flow Architecture

General capability / architectural questions for ETL listed, but not limited to below

  • Support for diverse data sources (from Data Repositories (SQL / NOSQL) to ERP (SAP / PeopleSoft..), Web Resources (Web Services, RSS Feeds).
  • Availability of high performance provider both for source and destination systems. (This is going to key for performance of ETL layer).
  • ETL Scalability is key, generally ETL systems due to in memory continuous pipe are good for row based transformation but not good for Set based operations. For example if a column of a row (row based operations) needs to be transformed, ETL is best tool but if data from multiple tables need to joined, aggregated  (set based operations) database technologies are better.
  • Also in a complex ETL workflow, there may be requirements for integration with messaging middleware like MQ Series or MSMQ or Biz Talk. Such requirements if any needs to be gathered. For example, SAP data may need to be extracted using middleware like Biz Talk and then ETL process will initiate.
  • Finally requirements for Auditing,  error handling & logging,  adherence to compliances are going to key.

 

DWH:

Data Warehouse at an architecture level is more aligned towards designing dimension models for required subject areas and adhered principles. Other non functional requirements like Size / volume of data etc come into play. Below depicted is a generic HR model that would help capture employee (Active / Left) information that could be later used for predicting employee attrition.

When moving to physical / deployment architectures, DBA skills will take a long way in implementing a large yet scalable and highly performant databases that host dimensional models. Also as a general practice and recommendation by Ralph Kimball (Father of dimensional model), relationships between Dimensions and facts are captured in a “BUS MATRIX”.

image

There are two other layers in architecture above, OLAP and Reporting Layer that will be covered next. Subsequent to that I will try to blog design and development aspects of BI solution.

what are bw trees, simplified summary from microsoft research paper

Azure DocumentDB utilizes Bw trees for indexing documents and that too in a schema-agnostic manner. While Azure DocumentDB has its own white paper on Schema-Agnostic Indexing in Azure DocumentDB, as a first step wanted to understand Bw tree and how they impact performance of DocumentDB.

REFERENCE: A Bw – tree for New Hardware Platforms

  • SUMMARY: Bw tree can be summarized as B+ tree with mapping table that virtualizes both location of size of physical page. This mapping table looks similar to Page Table structure used by Operating Systems to map virtual memory to a physical memory (RAM). This decoupling enables latch free data and log structured storage making Bw tree highly performant.
  • HARDWARE CHANGES:
    • In multi core server system, concurrency of thread execution is key for higher throughput. But with higher concurrency system level synchronization mechanisms like latches (that protect physical consistency of data in memory) tend to become blocking factor thus impacting scalability and throughput. When thread wait for an object, processor preempts that thread to be scheduled for later that increasing context switch again.
    • Additionally in multi core system, caches are shared across multiple logical processors. Updating data in place causes “Cache invalidation” that may require CPU cycles to fetch fresh data into cache. This results in context switch of threads and processors spend more time doing context switch than performing actual work.

Bw trees that enable a latch free access to data structure and performs delta (only data that changed) updates to minimize context switches and thus increase scalability and throughput.

Bw – Tree Architecture:

image

A Bw Tree layers include

  • Bw Tree Layer that is on top of Cache Layer and provides access methods (search / update) for underlying B+ Tree.
  • Cache Layer: When operations occur at Bw Tree layer instead of performing actions directly on physical pages, mapping table in cache layer is used. Mapping table is a map between logical pages (PIDs) and Physical Pages (either on SSD / Memory).
  • Flash Layer or Storage Layer implements Log Structured Storage enabling “Delta Updates”

Update: 10th Feb 2017 Let us know dive & understand deeper into each of these individual component.  We are at Section II Bw – Tree Architecture of Microsoft Research paper afore mentioned…

Mapping table as we understand is key to Bw Index which sits atop a physical B-Tree structure.

Caching Layer sits between Bw Tree Layer and Storage Layer. It maintains a Mapping Table. Mapping table maps logical pages to physical pages. Logical pages are identified by unique Page identifiers (PID) and each PID maps to a Physical Page. Physical Pages may be stored either In Memory or persisted on durable storage like SSDs / HDD. If physical page in SSD, mapping table holds offset of physical page and if physical page in Memory, address pointers are stored in mapping table. Thus objective of mapping table is to decouple or loosely couple logical pages in Bw layer to physical pages (irrespective of storage) . With this decouple approach, for Bw Tree upper layers it does not matter where physical page is as they access only logical pages (structures) that are stored in mapping table.

In addition to decoupling links between logical to physical pages, mapping table also holds inter node links either at different levels (child / parent) or same level (sibling). If nodes are linked at logical layer, there is no requirement to maintain inter node links at physical level. For example if we need to traverse a B Tree to seek a particular record, node traversal can be done at logical layer and on extracting leaf node record in mapping table, refer its physical location and extract data from storage. Additionally if any modifications to Tree structures (due to CRUD – R) operations, physical inter node links are not required to be maintained, only logical layer links need to be maintained.

Developing with HTU21DF Sensor part 4.

In previous posts we have detailed

In this post after laying ground work, let us start development and write code to capture Humidity and Temperature from Sensor. Code is written in C# using “Universal Windows Platform”

Am developing each module at a time and sharing it on GITHUB @ https://github.com/gurucb/HomeAutomation. For those we need code (without github write to me, gurucb@outlook.com, I will share them code).

under hood of HTU21DF sensor part 3

Cyclic Redundancy Check:

To ensure any detection of raw data modification (corruption) of sensor(HTU21DF) measurement, Cyclic Redundancy Check is used. Polynomial used for CRC is

image

Above polynomial can be written as

image

taking coefficients of all terms our binary values are 1 0 0 1 1 0 0 0 1. This binary value or polynomial generator is used a divisor for CRC computations.

Compute checksum: (Best refer to Wikipedia link Smile) or there are readymade tools. But if you are interested to know below are steps..

  • Step 1: Take input bytes from Sensor (MSB + LSB as detailed in previous post). They are 16 bits longs (2 bytes). Append to them 0’s 8 to right (should be equal to length of polynomial generator). <Update> As we are stuffing 0 to right of data bytes, effectively “Left Shift” operation occurs.
  • Start from left most bit that has bit value of 1 and perform an XOR bitwise operation with polynomial generator. Skip any 0s in middle and perform operations if bit left most is 1.
  • Repeat these operations moving Polynomial generators to right, where bit value is 0 till 16 bits.
  • In Step 1, due to polynomial movement, initially stuffed 0’s will have changed to different state and that result is output of CRC computation.

As an example, in previous example measured humidity is 31827 when converted to binary is input. Using polynomial generator in an iterative manner compute CRC. Sample computations are below.

image

In this manner, receiving application can compute CRC for MSB and LSB compare with CRC byte to check if data received is not corrupted.

Post measure and performing CRC checks, formula need to applied to raw measurements to arrive at accurate values as provided in manual.

Relative Humidity Conversion:

image

Where S(RH) is Humidity Signal Output that is measured from Sensor.

Relative Humidity Conversion:

image

With this HTU21DF sensor internals are complete… As continuation to this series of blog, follow up blog will detail how to programmatically get data out Sensor using RASPBERRY PI 2 MODEL B and SPARKFUN WEATHERSHIELD that includes HTU21DF sensor…

Until next post…

–Abhyast

under hood of HTU21DF sensor part 2

Schematic view of sensor with Pins:

image

  • GND: Ground
  • VDD: Voltage for Sensor between 3.6 to 5 Volts DC
  • SCK: Serial Clock Input is used to synchronize communication between master(microcontroller) and HTU21DF sensor.
  • Data: Serial data pin is used to transfer data in and out of sensor and device.

Communication Sequence with Sensors:

  • Start Sequence: A single bit. To initiate transmission start bit has to be issued by microcontroller.
  • Stop Sequence: A single bit again to indicate completion of transmission of data (along with CRC) and if required release holding of SCK.
  • Sending Command: After start bit to indicate initiation transmission, next is to send command to sensor identified by sensor address (similar to MAC address)
    • Device Address of HTU21DF address in hex 0x40 (in Decimal system = 64)
    • Device address is 7 bits wide with last bit capturing type of operation (Read / Write)
    • Thus controller will send initiate communication to HTU21DF Sensor as below for both read and write operations

As I2C protocol specifies, Start –> Address of Sensor (With Read / Write intent)

Read Operation:

image

Write Operation:

image

HTU21DF sensor responds (as its address is 0x40) with ACK (Acknowledgement). Then controller sends command to HTU21DF sensor. Commands their Hex and Binaries are below.

image

In comment section above there are “Hold Master” and “No Hold Master”. As detailed in previous post, I2C protocol provides options where sensor may choose to control serial clock (SCK).

Hold Master ensures control of SCK is with Sensor and controller can not perform any other actions with other sensors if available but is kept in waiting state till sensor measurement is complete. On the other hand, No Hold Master implies, control is SCK is with master (controller) and controller is free to communicate with other sensor. Additionally a polling is done with read command to check if measurement is done.

HOLD MASTER is analogous to synchronous programming and NO HOLD MASTER is equal to Asynchronous programming but with polling. May be NO HOLD MASTER is good for situations where there are more sensors on board.

Post measurement, data is sent by Sensor 2 bytes MSB(Most Significant Byte) followed by LSB (Least Significant Byte) and a checksum.

Additionally on LSB, last 2 significant bits are reserved for transmitting status information. In LSB (Bit 1 is used to indicate temperature / Humidity). 0 Temperature and 1 Humidity.

Communication Flow:

Step 1: Start (Initiate) session to Sensor (Slave in this case HTU21DF)

Step 2: Issue Write intent to sensor indicated by address

image

Step 3: Sensor sends ACK to master

Step 4: Issue “Measure Humidity with HOLD” command

image

Step 5: Sensor sends ACK

Step 6: Issue read intent to sensor with address of sensor.

image

Step 7: Sensor takes control of SCK line and makes master (microcontroller) wait till measurement is complete.

Step 8: Send Data. Post measurement sensor uses 14 bits (of data) + 2 bits (status) to send measured data using SDA line as output from device. Data is divided into 2 sections Most Significant Byte (MSB) and Least Significant Byte (LSB). First sent is MSB followed by LSB.

image

In above example, relative humidity is measured as 31827 (add numbers on right column (except status bits). Status column 1 indicates measurement is humidity (0 implies temperature).

Step 9: HTU21DF sensor provides CRC8 checksum. is computed by sensor and sent as an additionally redundant byte for data verification.

In this manner communication takes place between master (microcontroller) and I2C compliant sensor like HTU21DF.

Next section we will understand formulae for RH (relative humidity), Temperature and also verify data integrity from device using checksum provided in checksum Byte.

Appreciate if readers review and correct mistakes

Till then….

Abhyast..