As part of discussion forum, there was below question from my peer. A very valid question considering options one has to consider before building an “End to End BI solution”.
How DW process works, how data feeds to the DW, how batch job set up to feed to DW, who exactly creates star schema. In one line end-end process till data stores in DW. Then how ET tool pulls data from DW. And request was not to USE GOOGLE LINKS as answers.
Now this questions can not be answered in that forum.. so converted to a blog post (and also help me with more traffic ). There are series of blog posts that cover from Source Systems –> Building BI Systems –> Predicting Employee Attrition. And all along assets will be uploaded to github if anyone wants to reuse. Also, adhering to general practice, let us understand BI solution from architectural , design and development stand point to appreciate knowhow at each layer. This post details “Architecting BI Solution”.
Below depicted is “Conceptual BI architecture” that is most generic representation of BI solution on any platform and technology.
At an architectural level, below are questions that one need to ask.
Regarding source data few of questions
High degree of variation in terms of how data can be extracted from source systems. Categorize application and identify different methodologies to pull data.
- Home grown application, generally extraction at data layer is possible.
- ERP software like SAP, there is NO way to extract data from DB layer. Instead such application provide API and other access methodologies to pull data. One need to use only those to pull data.
- Cloud based SAAS solution only have API methods to pull data.
Generally extraction at application / middle tier layer is ONLY option for Third Party / External applications.
Data Formats: Different applications provide different data formats. If accessing DB directly it becomes straight forward but generally one ends up using different access methods as afore mentioned. When extracting data from API layer from source systems variations in data format will arise.
- Any RDBMS structured data.
- API Layer extraction
- Different file formats (CSV / TSV).
- JSON files
Data Volumes: Next aspect architect needs to be aware, is volume of delta / differential data.
- Does source system provide methods to pull delta data?
- If so, data volume (in terms of GBs / TBs) of differential daily extraction.
Other points to consider would be, Network Throughput, impact of extraction on performance of source systems, Security aspects (authorization / auditing), allowable staleness of data in DWH ( D – 1).
Once architect has clear picture of various sources and related inputs, next question one has to answer is should data be hosted in a ODS (Operational Data Store) or can it be directly loaded into dimensional models (Start / Snowflake Schema).
Requirements for Operational Data Stores: Operational data stores retain same schema as source systems and there is no schema level impedance between Source System (Relational Models) and ODS Schema. Only difference is ODS holds data from all source systems for a limited period of them before loaded into models of data warehouse.
- Source System: Normalized to store data model for that specific application.
- ODS: Normalized data models, hosted for all source systems.
- DWH: De-Normalized data models (as designed by data models).
To understand if ODS is required in a BI solution, these questions needs to be answered.
- Does current OLTP system support reporting or are systems too stressed (in terms for resources) that running reports will impact business transactions.
- For running reports, does data need to be integrated with other transaction systems. Like for example to understand a customer journey end to end on a site, not just Order System but Web Log transaction systems need to integrated to understand customer journey. Similar is case of taking inventory stock across various stores..
But be careful about ODS, not many people like it and may be because “Data Mart”, “Data Warehouse”, “ODS”, “Report Data Store” are used interchangeably. But if ODS is built, it becomes source for Data Warehouse.
Data extraction , transformation and load enables to move data from Source Systems to either ODS (if built) or to data models in Data Warehouse. A generic architecture data flow architecture for a BI solution depicted below. Notice those highlighted in yellow indicating ETL layer.
General capability / architectural questions for ETL listed, but not limited to below
- Support for diverse data sources (from Data Repositories (SQL / NOSQL) to ERP (SAP / PeopleSoft..), Web Resources (Web Services, RSS Feeds).
- Availability of high performance provider both for source and destination systems. (This is going to key for performance of ETL layer).
- ETL Scalability is key, generally ETL systems due to in memory continuous pipe are good for row based transformation but not good for Set based operations. For example if a column of a row (row based operations) needs to be transformed, ETL is best tool but if data from multiple tables need to joined, aggregated (set based operations) database technologies are better.
- Also in a complex ETL workflow, there may be requirements for integration with messaging middleware like MQ Series or MSMQ or Biz Talk. Such requirements if any needs to be gathered. For example, SAP data may need to be extracted using middleware like Biz Talk and then ETL process will initiate.
- Finally requirements for Auditing, error handling & logging, adherence to compliances are going to key.
Data Warehouse at an architecture level is more aligned towards designing dimension models for required subject areas and adhered principles. Other non functional requirements like Size / volume of data etc come into play. Below depicted is a generic HR model that would help capture employee (Active / Left) information that could be later used for predicting employee attrition.
When moving to physical / deployment architectures, DBA skills will take a long way in implementing a large yet scalable and highly performant databases that host dimensional models. Also as a general practice and recommendation by Ralph Kimball (Father of dimensional model), relationships between Dimensions and facts are captured in a “BUS MATRIX”.
There are two other layers in architecture above, OLAP and Reporting Layer that will be covered next. Subsequent to that I will try to blog design and development aspects of BI solution.