Configuring Jupyter for public, remote access

 

As a part of data analytics course project for IIITB, a team as formed and a group project was our target. The team we had was spread across (3 in Bangalore and 1 US). So, I thought best option for doing a collaborative project was using Notebooks. Though worked in a limited way using Jupyter Notebooks, never tried with a public facing end point where any user with right credentials can login and work.

Below are steps to enable Jupyter Notebooks for shared , collaborative analytics experience.

Note: Did not use Jupyter Hub for multi User Server (though ideal would want to setup simple and fast for our projects

Steps:

  • Create Virtual Machine using Azure Portal  (Or any public cloud or a machine with public end point)
    • 4 Core, 16 GB RAM, Linux OR
    • 4 Core, 16 GB RAM, Windows 2012

Though basic R and R Studio are not required for this, we have installed nevertheless for any local server testing and validations.

  • Installed R Studio.
  • Conda is package and environment management system.
    • Using Conda multiple version of R / Python can be run simultaneously without impacting each other environment.
    • Conda can installed using either Anaconda / Miniconda.
  • Installed conda with Miniconda.
  • As default conda package management does not include R environment, we created a new R environment using R essentials
  • After installation is complete, open command prompt and run below command to create R Environment
    • conda install -c r r-essentials=1.5.2
    • This command takes time to install R Packages.
  • Test it by running command below to see if Jupyter Notebooks are properly installed
    • jupyter notebook
  • It should open a browser and open http://localhost:8888/tree where we will be able to create and R notebook.

image

Next steps enable netbook server side for public access.

  • Open command prompt and type below command to create new configuration
    • jupyter notebook –generate-config
    • It gives path where configuration file is stored.
  • Run command to create password
    • jupyter notebook password
    • As is with previous command, returns path where password hash is stored and it will be in same directory as configuration file.
  • In configuration file do below,
    • Search c.NotebookApp.password and replace hash that is present in password hash file created.
    • Search c.NotebookApp.ip, remove # (uncomment) from front and put ‘*’ if users can connect from anywhere. As we are opening server for public access with only password, recommendation is to put a very strong password.
    • Search c.NotebookApp.allow_origin and remove # and set it to ‘*’.
    • Search c.NotebookApp.port and set it to port 8888
  • On Local Server, enable firewall exceptions to allow notebook web layers and kernel servers to communicate. We did open all ports for 127.0.0.1 for both incoming and outgoing. Also if public cloud, Network Security Groups setting have to be configured to allow data for port 8888.
  • Shutdown and restart jupyter netbooks and we are set to collaborate data science project.

While configuring we realized that there is lot more Conda (Anaconda / Miniconda) and would have to dwell much much deeper into how internally these things work and best practices one has to adhere to for a large scale deployment. But for now quite happy to get this started in a matter 30 minutes.

References:

Architecting BI solution..

As part of discussion forum, there was below question from my peer. A very valid question considering options one has to consider before building an “End to End BI solution”.

How DW process works, how data feeds to the DW, how batch job set up to feed to DW, who exactly creates star schema. In one line end-end process till data stores in DW. Then how ET tool pulls data from DW. And request was not to USE GOOGLE LINKS as answers.

Now this questions can not be answered in that forum.. so converted to a blog post (and also help me with more traffic Winking smile). There are series of blog posts that cover from Source Systems –> Building BI Systems –> Predicting Employee Attrition. And all along assets will be uploaded to github if anyone wants to reuse. Also, adhering to general practice, let us understand BI solution from architectural , design and development stand point to appreciate knowhow at each layer. This post details “Architecting BI Solution”.

Below depicted is “Conceptual BI architecture” that is most generic representation of BI solution on any platform and technology.

BI Conceptual Architecture

At an architectural level, below are questions that one need to ask.

Sourcing Data:

Regarding source data few of questions

Access Methods:

High degree of variation in terms of how data can be extracted from source systems. Categorize application and identify different methodologies to pull data.

  • Home grown application, generally extraction at data layer is possible.
  • ERP software like SAP, there is NO way to extract data from DB layer. Instead such application provide API and other access methodologies to pull data. One need to use  only those to pull data.
  • Cloud based SAAS solution only have API methods to pull data.

Generally extraction at application / middle tier layer is ONLY option for Third Party / External applications.

Data Formats: Different applications provide different data formats. If accessing DB directly it becomes straight forward but generally one ends up using different access methods as afore mentioned. When extracting data from API layer from source systems variations in data format will arise.

  • Any RDBMS structured data.
  • API Layer extraction
    • Different file formats (CSV / TSV).
    • JSON files

Data Volumes: Next aspect architect needs to be aware, is volume of delta / differential data.

  • Does source system provide methods to pull delta data?
  • If so, data volume (in terms of GBs / TBs) of differential daily extraction.

Other points to consider would be, Network Throughput, impact of extraction on performance of source systems, Security aspects (authorization / auditing), allowable staleness of data in DWH ( D – 1).

Once architect has clear picture of various sources and related inputs, next question one has to answer is should data be hosted in a ODS (Operational Data Store) or can it be directly loaded into dimensional models (Start / Snowflake Schema).

Requirements for Operational Data Stores: Operational data stores retain same schema as source systems and there is no schema level impedance between Source System (Relational Models) and ODS Schema. Only difference is ODS holds data from all source systems for a limited period of them before loaded into models of data warehouse.

  • Source System: Normalized to store data model for that specific application.
  • ODS: Normalized data models, hosted for all source systems.
  • DWH: De-Normalized data models (as designed by data models).

To understand if ODS is required in a BI solution, these questions needs to be answered.

  • Does current OLTP system support reporting or are systems too stressed (in terms for resources) that running reports will impact business transactions.
  • For running reports, does data need to be integrated with other transaction systems. Like for example to understand a customer journey end to end on a site, not just Order System but Web Log transaction systems need to integrated to understand customer journey. Similar is case of taking inventory stock across various stores..

But be careful about ODS, not many people like it and may be because “Data Mart”, “Data Warehouse”, “ODS”, “Report Data Store” are used interchangeably. But if ODS is built, it becomes source for Data Warehouse.

 

ETL:

Data extraction , transformation and load enables to move data from Source Systems to either ODS (if built) or to data models in Data Warehouse.  A generic architecture data flow architecture for a BI solution depicted below. Notice those highlighted in yellow indicating ETL layer.

Data Flow Architecture

General capability / architectural questions for ETL listed, but not limited to below

  • Support for diverse data sources (from Data Repositories (SQL / NOSQL) to ERP (SAP / PeopleSoft..), Web Resources (Web Services, RSS Feeds).
  • Availability of high performance provider both for source and destination systems. (This is going to key for performance of ETL layer).
  • ETL Scalability is key, generally ETL systems due to in memory continuous pipe are good for row based transformation but not good for Set based operations. For example if a column of a row (row based operations) needs to be transformed, ETL is best tool but if data from multiple tables need to joined, aggregated  (set based operations) database technologies are better.
  • Also in a complex ETL workflow, there may be requirements for integration with messaging middleware like MQ Series or MSMQ or Biz Talk. Such requirements if any needs to be gathered. For example, SAP data may need to be extracted using middleware like Biz Talk and then ETL process will initiate.
  • Finally requirements for Auditing,  error handling & logging,  adherence to compliances are going to key.

 

DWH:

Data Warehouse at an architecture level is more aligned towards designing dimension models for required subject areas and adhered principles. Other non functional requirements like Size / volume of data etc come into play. Below depicted is a generic HR model that would help capture employee (Active / Left) information that could be later used for predicting employee attrition.

When moving to physical / deployment architectures, DBA skills will take a long way in implementing a large yet scalable and highly performant databases that host dimensional models. Also as a general practice and recommendation by Ralph Kimball (Father of dimensional model), relationships between Dimensions and facts are captured in a “BUS MATRIX”.

image

There are two other layers in architecture above, OLAP and Reporting Layer that will be covered next. Subsequent to that I will try to blog design and development aspects of BI solution.