Bots to Hire Humans. Trained by Humans to hire human–Part 3




  • Part 1 of this series, provided business justifications and frameworks to validate success of such solution, if it were to be built.
  • Part 2 of this series, discussed about data exhausts for building a models.

In this post, we will detail one of few methodologies to build such a solution.

Based on available data,

  • Multiple models are developed and
  • Output scores from each models are ensemble to
  • Compute a final outcome of probability of  selection of candidate and
  • using that score and threshold slot candidates for interviews.


  • Compatibility Score:
        • Tokenize text to bag of words
        • Remove noise and other language dialects.
        • Use term frequency and inverse frequency metrics to get importance of words
        • Generalize resume words and
        • Find distance between candidate resume and job description.
  • Competitive Score:
        • Based on historical data, convert candidates resumes to features with labels of clearing interview or not.
        • Build a ML Model that uses these features and
        • Gives probability score that predicts if candidate may get selected.
  • Culture Fitment Score:
        • Cluster existing set of employee with feature baselined to candidates
        • Cluster them into groups as designed (High Performers , Average Performers, Poor Performers etc, just an example)
        • Slot new candidate into one of these buckets and get scores.

Ensemble all these scores to build a model, that outputs probability of candidate converting into employee and based on that slot him  / her for an interview.

Mind you, we are NOT replacing interview process, our objective is only to find most probable candidate to clear interviews.


Bots to Hire Humans. Trained by Humans to hire human – Part 2

Blog subject may be look more of click bait ( actually no hits better Smile ) considering position taken by industry stalwarts recently.



Part 1 of this series, provided business justifications and frameworks to validate success of such solution, if it were to be built.

Moving along, first step in building solution is identifying sources of data. With computing becoming ubiquitous, every candidate leaves a data trail, aka data exhaust,  that when tapped can be transformed into ambient intelligence about candidates to be screened. DIKW (Data, Information, Knowledge, Wisdom) pyramid was backbone of lot of analytics product architectures. Thinking in most generalist sense, converting Data to information, in an easier, intuitive and productive way  is what is sole existential reason for products like Power BI , Tableau, Qlick and plethora of others. was business models. With Machine Learning now generally becoming main stream in solutions, gap between data and information is slowly narrowing. Computing and modeling advancements further blur between data and information and machines are used to identify patterns in data and leave to humans converting that information to useful knowledge.



For our candidate screening automated solution, sources of data can be categorized into

  • Resume
  • Social (LinkedIn, Facebooks)
  • Blogs
  • many others, more the merrier.

Multiple ways of classifications are possible. Resume data can be categorized as claim and other supporting data sources can be used either for validation of claim or improving knowledge about candidate. Below is process flow for a hiring process.

  • Job Description details what is needed by an organization.
  • Resume claims that candidate matches to role detailed Job description.
  • Screening , subsequent interviews till final selection are steps to validate candidate claims (as detailed in resumes)

Aligning to earlier stated goal of rejecting candidate early, social media and data other than resume, can be used to validate candidates claims through resumes. And based on such validations, knowledge about candidate enhances and enables system to take informed decisions to screen a candidate. But there is a downside to this approach, what if a candidate does not belong to social network sites, does not blog and tries to keep digital exhaust to minimum. Information bias is key problems for such solutions, as data in social media is itself biased. Good related articles in information bias,

Here below depicts one inputs for HR screening solution, ie is information about candidate.


As afore mentioned, hiring process starts with Role / Job Description details. Candidate applying CLAIMING he / she has right skillsets. Followed by Screening and Interviewing that validates claims of candidates and thus provide knowledge about him / her to take a correct decisions. So remaining sources of information could be leveraged for solution could include

  • Job Description
  • Interview process (Have you handled stress interviews??)
  • Interviewer (may be having bad day??)

First step in screening process it to match resume claims with Job description (JD) of opening. If you are for solution, next post deals with how to automatically identify nearness of resume with job description and challenges in doing that.

Bots to Hire Humans. Trained by humans to hire human.

Blog subject may be look more of click bait ( actually no hits better Smile ) considering position taken by industry stalwarts recently.



Gone are days where IT companies would hire in bulk. Roll back few years back, when automation was still not present, IT and ITES companies would hire in hordes. But even then number of candidates interviewed were high with respect to number of selected and ultimately joined. Here is simple math that details problem, thus an opportunity for a product. Hit ratio for hiring refers to  Number of Candidates (Selected / Joined) / Number of Candidates Screened. As I have learnt, it is around 10 – 15% for good companies, i.e out of 100 candidates screened, only 10 to 15 are selected. Implies 85 – 90 candidates are rejected at various stages of interview process. Lets add money angle to drive point here.

Imagine company X hires 10K employees every year with hit ratio of 15%. For 10 K they need to screen at least 150K. Conservatively even if 100 K of candidates could have been rejected early in hiring cycle, more economical for company. Assuming 100/- per candidate whether selected or rejected, it turns out 100 K * 100 /- equals approx. 10 Crore rupees. Please keep in mind this is not verified but a gist of discussion with few HR teams at a symposium. So, there atleast seems a justification in thinking a solution to see if there could be a way to implement automated screening process that ensure candidates, who ultimately fail selection process, fail early in cycle which could be a win – win for both candidate and company.  

This post will initiate slides with commentary and in later posts detail each of these slides in detail.

1. Everyone Wants Super Employee

Slide is self explanatory.. Right Talent, Just in Time with best fit. Simple one way to explain it is “Pink Squirrel”. Are you pink squirrel? Pink Squirrel refers to a candidate who perfectly matches a job’s requirements in every way, from education, to skills, to personality.

A purple squirrel nibbles on some food



2. Business metrics for Machine Learning models: Is Machine Learning actually helping business or is everyone just travelling on hype cycle with peak of inflated expectation.

One of important aspects that I have learned is for Machine Learning tasks, defining business metrics that capture success is very important, else it ends up good project to have on resumes Winking smile. So here in this slide we define metrics that define success of machine learning project if it were to be introduced to HR hiring cycle.


Few points from above slide,

  • Reach Ratio is also efficiency of hiring process.
    • Conducting first level screening manually requires lots of eyeballs to scan resumes and select possible candidates for next level screening. Manually implies cost and limits (of course no one is hiring in millions Smile ).
    • As more candidate resumes pour in, to increase efficiency (more resumes screened), effectiveness is impacted.
  • Candidate Hit Ratio is effectiveness of screening in hiring process.
    • If a candidate is screened and slotted for interview, screening is perfect. 

Now Reach Ratio tending towards Zero implies for a job opening, hire as many candidates as possible (more the better), and effectively choose a single candidate who will clear all interview and get selected and will join. Being realistic, above statement is nirvana state and not practically possible, even if solution predicts with 70 – 80% accuracy, that is a huge gain for hiring process. 

If you are for solution, next post deals with data exhaust from candidates..

Handling Missing Values

Once again, part of project work at IIITB, we were tasked to find out best startup to invest. Not going into details of project, instead I would like to pickup a specific problem where we were supposed to handle missing values.

Only fundingamount feature is required for our analysis.

No of Observations: 114949 with 19990 missing values or NAs.


Summary statistics: summary(missingValues$fundingamount)

Min.   1st Qu.    Median      Mean   3rd Qu.      Max.      NA's 
0.000e+00 3.225e+05 1.681e+06 1.043e+07 7.000e+06 2.127e+10     19990

To look at missing data across all features and observations, “mice” package may be used.

Multi feature missing observations:



It is cross tab between features and number of observations missing. From above result

  • fundingamount is present in 94959 (value 1) and missing in 19990 (value 0)
  • fundingtype is present in all observations.

Edge columns (row number 3 and column V3) are best understand using binary math.

  • fundingtype it has 1 for both 94959 and 19990. Adding 1 to 1 in binary gives 10 take 0.
  • Thus missing values for fundingtype is 0 implies there are no missing values.
  • Similar is computation is done across rows
    • 94959 observations have values both for fundingtype and fundingamount (thus missing (0))

Visually interpreting missing data:

aggr_plot <- aggr(
  missingValues, col=c(“navyblue”,’red’),
  numbers = T, sortVars = T,
  labels = names(missingValues),
  cex.axis = 7, gap = 3,
  ylab = c(“History of missing data”, “Pattern”)


 Variables sorted by number of missings: 
      Variable     Count
 fundingamount 0.1739032
   fundingtype 0.0000000

Graphical Result:


As there 17% missing values in funding amount of total 114 K rows, ignoring NA rows is not an optimal solution. Instead requirement is to impute values in place of NA. Now challenge is what values it impute.

As feature is a ratio variable options for imputing include:

  • Use 0
  • Use Mean  , Median
  • Compute mean, after outlier clippings
  • Compute mean, after Boxplot and find outliers and clip
  • Feature transformation
  • Using MICE / PCA (Considering only 2 features these will not be done)

Computing base mean and density distribution:

Let us compute base mean after removing NA values. This base mean is impacted by outliers on either sides.

  • plot(density(filter(missingValues,!$fundingamount),col = “red”,main=”Dist. Rasied Amount”)

Notice red line that is parallel to Y axis. It is a extreme right skewed distribution, where there are few fundingamounts equal to 2 billion but almost most of them are near to 0. Also above summary output reinforces that Min value = 0, Max Value = 2 Billion dollars.


  • boxplot(filter(missingValues,!$fundingamount,main=”Dist. Rasied Amount”)

Below box plot also show data is skewed and there seems to only on


  • floor(mean(filter(missingValues,!$fundingamount))
Base Mean: 10,426,869 (10.42 M Dollars)

Clipping Single Outlier values from Top and Bottom:

As outlier impact mean, one simple method we used was to clip single outlier values (observations could be many if equal to value) both from top and bottom of mean. And then we computed mean again. Notice even after we removed single outlier values on either side our distribution is extremely skewed. Only mean reduced by 20 million.

  • outlier(filter(missingValues,!$fundingamount) –> maxOutlier
  • outlier(filter(missingValues,!$fundingamount,opposite = TRUE) –> minOutlier
  • plot(density(filter(missingValues, (!( & (fundingamount) > minOutlier) & (fundingamount)<maxOutlier))$fundingamount),main=”Dist. fundingamount with one outlier removed”,col=”red”)


boxplot(filter(missingValues, (!( & (fundingamount) > minOutlier) & (fundingamount)<maxOutlier))$fundingamount,main=”Dist. fundingamount with one outlier removed”)


  • mean(filter(missingValues, (!( & (fundingamount) > minOutlier) & (fundingamount)<maxOutlier))$fundingamount)
Mean with Single Outliers values clipped 10202965 10.20 M USD

Clipping Outlier values Using Box Plot:

BoxPlot provides a heuristics where any point away 1.5 times whiskers on either side is considered outlier. Using that we removed outliers again. Now we saw data move slightly towards normal form. We could have adjusted 1.5 to 1.2 or even less but with that our number of rows would come down. But nevertheless it showed we were on right track.

    • outliervalues <- boxplot.stats(missingValues$fundingamount)$out
  • plot( density ( filter ( missingValues, !(fundingamount %in% outliervalues) & !$fundingamount), main = “Dist. With outliers removed using Box Plot”,col=”red”)


  • boxplot((filter(missingValues,!(fundingamount %in% outliervalues) & !$fundingamount), main = “Dist. With outliers removed using Box Plot”,col=”red”)


  • floor(mean(filter(missingValues,!(fundingamount %in% outliervalues) & !$fundingamount))
Mean after outlier clipping using Boxplot: 3064248 (3.06 M Dollars)

Log Transformation:

We did log transformation of variable and plotted. Then transformed data for approximately normal. Used this to find mean and applied antilog to get mean value. This value was used to replace NA. One thing post replacement of NA with mean arrived in this method did not much alter shape of original distribution.

  • plot(density(log(filter(missingValues,! & fundingamount > 0 )$fundingamount)),main=”Dist. Log(fundingamount)”,col=”red”)


  • boxplot((log(filter(missingValues,! & fundingamount > 0 )$fundingamount)),main=”Dist. Log(fundingamount)”,col=”red”)


  • floor(exp(mean(log(filter(missingValues,! & fundingamount > 0 )$fundingamount))))
Mean After Log Transformation: 1422627 (1.42 Million Dollars)

Means across various methods:

Method Mean
Base 10.42 M USD
Single Outlier Values Clipped 10.20 M USD
Outliers clipped with Box Plot 3.06 M USD
Log Transformation 1.42 M USD

Summary: As data is completely skewed, mean for such distribution is not apt representative and outlier clipped though helped to some extent did not remove data skew as Log transformation achieved. Based on this in our project we have considered Log transformation as optimal and used that to compute mean that was used to replace NA/ missing values.

If you feel we could have tried other options as well please let us know through comments section.

Until next learning…


Configuring Jupyter for public, remote access


As a part of data analytics course project for IIITB, a team as formed and a group project was our target. The team we had was spread across (3 in Bangalore and 1 US). So, I thought best option for doing a collaborative project was using Notebooks. Though worked in a limited way using Jupyter Notebooks, never tried with a public facing end point where any user with right credentials can login and work.

Below are steps to enable Jupyter Notebooks for shared , collaborative analytics experience.

Note: Did not use Jupyter Hub for multi User Server (though ideal would want to setup simple and fast for our projects


  • Create Virtual Machine using Azure Portal  (Or any public cloud or a machine with public end point)
    • 4 Core, 16 GB RAM, Linux OR
    • 4 Core, 16 GB RAM, Windows 2012

Though basic R and R Studio are not required for this, we have installed nevertheless for any local server testing and validations.

  • Installed R Studio.
  • Conda is package and environment management system.
    • Using Conda multiple version of R / Python can be run simultaneously without impacting each other environment.
    • Conda can installed using either Anaconda / Miniconda.
  • Installed conda with Miniconda.
  • As default conda package management does not include R environment, we created a new R environment using R essentials
  • After installation is complete, open command prompt and run below command to create R Environment
    • conda install -c r r-essentials=1.5.2
    • This command takes time to install R Packages.
  • Test it by running command below to see if Jupyter Notebooks are properly installed
    • jupyter notebook
  • It should open a browser and open http://localhost:8888/tree where we will be able to create and R notebook.


Next steps enable netbook server side for public access.

  • Open command prompt and type below command to create new configuration
    • jupyter notebook –generate-config
    • It gives path where configuration file is stored.
  • Run command to create password
    • jupyter notebook password
    • As is with previous command, returns path where password hash is stored and it will be in same directory as configuration file.
  • In configuration file do below,
    • Search c.NotebookApp.password and replace hash that is present in password hash file created.
    • Search c.NotebookApp.ip, remove # (uncomment) from front and put ‘*’ if users can connect from anywhere. As we are opening server for public access with only password, recommendation is to put a very strong password.
    • Search c.NotebookApp.allow_origin and remove # and set it to ‘*’.
    • Search c.NotebookApp.port and set it to port 8888
  • On Local Server, enable firewall exceptions to allow notebook web layers and kernel servers to communicate. We did open all ports for for both incoming and outgoing. Also if public cloud, Network Security Groups setting have to be configured to allow data for port 8888.
  • Shutdown and restart jupyter netbooks and we are set to collaborate data science project.

While configuring we realized that there is lot more Conda (Anaconda / Miniconda) and would have to dwell much much deeper into how internally these things work and best practices one has to adhere to for a large scale deployment. But for now quite happy to get this started in a matter 30 minutes.


Architecting BI solution..

As part of discussion forum, there was below question from my peer. A very valid question considering options one has to consider before building an “End to End BI solution”.

How DW process works, how data feeds to the DW, how batch job set up to feed to DW, who exactly creates star schema. In one line end-end process till data stores in DW. Then how ET tool pulls data from DW. And request was not to USE GOOGLE LINKS as answers.

Now this questions can not be answered in that forum.. so converted to a blog post (and also help me with more traffic Winking smile). There are series of blog posts that cover from Source Systems –> Building BI Systems –> Predicting Employee Attrition. And all along assets will be uploaded to github if anyone wants to reuse. Also, adhering to general practice, let us understand BI solution from architectural , design and development stand point to appreciate knowhow at each layer. This post details “Architecting BI Solution”.

Below depicted is “Conceptual BI architecture” that is most generic representation of BI solution on any platform and technology.

BI Conceptual Architecture

At an architectural level, below are questions that one need to ask.

Sourcing Data:

Regarding source data few of questions

Access Methods:

High degree of variation in terms of how data can be extracted from source systems. Categorize application and identify different methodologies to pull data.

  • Home grown application, generally extraction at data layer is possible.
  • ERP software like SAP, there is NO way to extract data from DB layer. Instead such application provide API and other access methodologies to pull data. One need to use  only those to pull data.
  • Cloud based SAAS solution only have API methods to pull data.

Generally extraction at application / middle tier layer is ONLY option for Third Party / External applications.

Data Formats: Different applications provide different data formats. If accessing DB directly it becomes straight forward but generally one ends up using different access methods as afore mentioned. When extracting data from API layer from source systems variations in data format will arise.

  • Any RDBMS structured data.
  • API Layer extraction
    • Different file formats (CSV / TSV).
    • JSON files

Data Volumes: Next aspect architect needs to be aware, is volume of delta / differential data.

  • Does source system provide methods to pull delta data?
  • If so, data volume (in terms of GBs / TBs) of differential daily extraction.

Other points to consider would be, Network Throughput, impact of extraction on performance of source systems, Security aspects (authorization / auditing), allowable staleness of data in DWH ( D – 1).

Once architect has clear picture of various sources and related inputs, next question one has to answer is should data be hosted in a ODS (Operational Data Store) or can it be directly loaded into dimensional models (Start / Snowflake Schema).

Requirements for Operational Data Stores: Operational data stores retain same schema as source systems and there is no schema level impedance between Source System (Relational Models) and ODS Schema. Only difference is ODS holds data from all source systems for a limited period of them before loaded into models of data warehouse.

  • Source System: Normalized to store data model for that specific application.
  • ODS: Normalized data models, hosted for all source systems.
  • DWH: De-Normalized data models (as designed by data models).

To understand if ODS is required in a BI solution, these questions needs to be answered.

  • Does current OLTP system support reporting or are systems too stressed (in terms for resources) that running reports will impact business transactions.
  • For running reports, does data need to be integrated with other transaction systems. Like for example to understand a customer journey end to end on a site, not just Order System but Web Log transaction systems need to integrated to understand customer journey. Similar is case of taking inventory stock across various stores..

But be careful about ODS, not many people like it and may be because “Data Mart”, “Data Warehouse”, “ODS”, “Report Data Store” are used interchangeably. But if ODS is built, it becomes source for Data Warehouse.



Data extraction , transformation and load enables to move data from Source Systems to either ODS (if built) or to data models in Data Warehouse.  A generic architecture data flow architecture for a BI solution depicted below. Notice those highlighted in yellow indicating ETL layer.

Data Flow Architecture

General capability / architectural questions for ETL listed, but not limited to below

  • Support for diverse data sources (from Data Repositories (SQL / NOSQL) to ERP (SAP / PeopleSoft..), Web Resources (Web Services, RSS Feeds).
  • Availability of high performance provider both for source and destination systems. (This is going to key for performance of ETL layer).
  • ETL Scalability is key, generally ETL systems due to in memory continuous pipe are good for row based transformation but not good for Set based operations. For example if a column of a row (row based operations) needs to be transformed, ETL is best tool but if data from multiple tables need to joined, aggregated  (set based operations) database technologies are better.
  • Also in a complex ETL workflow, there may be requirements for integration with messaging middleware like MQ Series or MSMQ or Biz Talk. Such requirements if any needs to be gathered. For example, SAP data may need to be extracted using middleware like Biz Talk and then ETL process will initiate.
  • Finally requirements for Auditing,  error handling & logging,  adherence to compliances are going to key.



Data Warehouse at an architecture level is more aligned towards designing dimension models for required subject areas and adhered principles. Other non functional requirements like Size / volume of data etc come into play. Below depicted is a generic HR model that would help capture employee (Active / Left) information that could be later used for predicting employee attrition.

When moving to physical / deployment architectures, DBA skills will take a long way in implementing a large yet scalable and highly performant databases that host dimensional models. Also as a general practice and recommendation by Ralph Kimball (Father of dimensional model), relationships between Dimensions and facts are captured in a “BUS MATRIX”.


There are two other layers in architecture above, OLAP and Reporting Layer that will be covered next. Subsequent to that I will try to blog design and development aspects of BI solution.

what are bw trees, simplified summary from microsoft research paper

Azure DocumentDB utilizes Bw trees for indexing documents and that too in a schema-agnostic manner. While Azure DocumentDB has its own white paper on Schema-Agnostic Indexing in Azure DocumentDB, as a first step wanted to understand Bw tree and how they impact performance of DocumentDB.

REFERENCE: A Bw – tree for New Hardware Platforms

  • SUMMARY: Bw tree can be summarized as B+ tree with mapping table that virtualizes both location of size of physical page. This mapping table looks similar to Page Table structure used by Operating Systems to map virtual memory to a physical memory (RAM). This decoupling enables latch free data and log structured storage making Bw tree highly performant.
    • In multi core server system, concurrency of thread execution is key for higher throughput. But with higher concurrency system level synchronization mechanisms like latches (that protect physical consistency of data in memory) tend to become blocking factor thus impacting scalability and throughput. When thread wait for an object, processor preempts that thread to be scheduled for later that increasing context switch again.
    • Additionally in multi core system, caches are shared across multiple logical processors. Updating data in place causes “Cache invalidation” that may require CPU cycles to fetch fresh data into cache. This results in context switch of threads and processors spend more time doing context switch than performing actual work.

Bw trees that enable a latch free access to data structure and performs delta (only data that changed) updates to minimize context switches and thus increase scalability and throughput.

Bw – Tree Architecture:


A Bw Tree layers include

  • Bw Tree Layer that is on top of Cache Layer and provides access methods (search / update) for underlying B+ Tree.
  • Cache Layer: When operations occur at Bw Tree layer instead of performing actions directly on physical pages, mapping table in cache layer is used. Mapping table is a map between logical pages (PIDs) and Physical Pages (either on SSD / Memory).
  • Flash Layer or Storage Layer implements Log Structured Storage enabling “Delta Updates”

Update: 10th Feb 2017 Let us know dive & understand deeper into each of these individual component.  We are at Section II Bw – Tree Architecture of Microsoft Research paper afore mentioned…

Mapping table as we understand is key to Bw Index which sits atop a physical B-Tree structure.

Caching Layer sits between Bw Tree Layer and Storage Layer. It maintains a Mapping Table. Mapping table maps logical pages to physical pages. Logical pages are identified by unique Page identifiers (PID) and each PID maps to a Physical Page. Physical Pages may be stored either In Memory or persisted on durable storage like SSDs / HDD. If physical page in SSD, mapping table holds offset of physical page and if physical page in Memory, address pointers are stored in mapping table. Thus objective of mapping table is to decouple or loosely couple logical pages in Bw layer to physical pages (irrespective of storage) . With this decouple approach, for Bw Tree upper layers it does not matter where physical page is as they access only logical pages (structures) that are stored in mapping table.

In addition to decoupling links between logical to physical pages, mapping table also holds inter node links either at different levels (child / parent) or same level (sibling). If nodes are linked at logical layer, there is no requirement to maintain inter node links at physical level. For example if we need to traverse a B Tree to seek a particular record, node traversal can be done at logical layer and on extracting leaf node record in mapping table, refer its physical location and extract data from storage. Additionally if any modifications to Tree structures (due to CRUD – R) operations, physical inter node links are not required to be maintained, only logical layer links need to be maintained.