“DAY 3: LEARN 3 THINGS A DAY – LEARNING 2”

Data Vault Series 1– Data Vault Overview

http://tdan.com/data-vault-series-2-data-vault-components/5155

In DV model task of data transformation, cleansing and de-duplication is moved from Source to DWH (Data Warehouse) to DWH to DM (Data Mart layer). Thus effort needed to move data from disparate source systems to centralized DWH model becomes simpler . As I understood till now

 

  • Using DV modeling simplify, expedite data modeling and loading data from source system to DW model.
  • Simply modeling to dimension models for data marts preferably by using views (for faster loading into dimension model) unless performance becomes an issue.

As detailed in previous learning post, main entities of DV modeling are

  • Hubs
  • Links
  • Satellites

For this learning I am using AdventureWorks database. Picked few tables with scripts below

CREATE TABLE [ProductCategory](
    [ProductCategoryID] [int] IDENTITY(1,1) NOT NULL,
    [Name] [dbo].[Name] NOT NULL,
     CONSTRAINT [PK_ProductCategory_ProductCategoryID] PRIMARY KEY CLUSTERED
    (
        [ProductCategoryID] ASC
    )
)

CREATE TABLE [ProductSubcategory](
    [ProductSubcategoryID] [int] IDENTITY(1,1) NOT NULL,
    [ProductCategoryID] [int] NOT NULL,
    [Name] varchar(100) NOT NULL,
    CONSTRAINT [PK_ProductSubcategory_ProductSubcategoryID] PRIMARY KEY CLUSTERED
    (
        [ProductSubcategoryID] ASC
    )
)

CREATE TABLE [Product](
    [ProductID] [int] IDENTITY(1,1) NOT NULL,
    [Name] nvarchar(100) NOT NULL,
    [ProductNumber] [nvarchar](25) NOT NULL,
    [Color] [nvarchar](15) NULL,
    [SafetyStockLevel] [smallint] NOT NULL,
    [ReorderPoint] [smallint] NOT NULL,
    [StandardCost] [money] NOT NULL,
    [ListPrice] [money] NOT NULL,
    [Size] [nvarchar](5) NULL,
    [SizeUnitMeasureCode] [nchar](3) NULL,
    [WeightUnitMeasureCode] [nchar](3) NULL,
    [Weight] [decimal](8, 2) NULL,
    [DaysToManufacture] [int] NOT NULL,
    [ProductLine] [nchar](2) NULL,
    [Class] [nchar](2) NULL,
    [Style] [nchar](2) NULL,
    [ProductSubcategoryID] [int] NULL,
    [ProductModelID] [int] NULL,
    [SellStartDate] [datetime] NOT NULL,
    [SellEndDate] [datetime] NULL,
    [DiscontinuedDate] [datetime] NULL,
    CONSTRAINT [PK_Product_ProductID] PRIMARY KEY CLUSTERED
    (
        [ProductID] ASC
    )
)

CREATE TABLE [SalesOrderHeader](
    [SalesOrderID] [int] IDENTITY(1,1) NOT NULL,
    [OrderDate] [datetime] NOT NULL ,
    [SalesOrderNumber]  NOT NULL,
    [CustomerID] [int] NOT NULL,
    [SalesPersonID] [int] NULL,
    [TerritoryID] [int] NULL,
    [BillToAddressID] [int] NOT NULL,
    [ShipToAddressID] [int] NOT NULL,
    [ShipMethodID] [int] NOT NULL,
    [CreditCardID] [int] NULL,
    [CreditCardApprovalCode] [varchar](15) NULL,
    [CurrencyRateID] [int] NULL,
    [SubTotal] [money] NOT NULL,
    [TaxAmt] [money] NOT NULL ,
    [Freight] [money] NOT NULL ,
    [TotalDue]  AS (isnull(([SubTotal]+[TaxAmt])+[Freight],(0))),
    [Comment] [nvarchar](128) NULL,
    CONSTRAINT [PK_SalesOrderHeader_SalesOrderID] PRIMARY KEY CLUSTERED
    (
        [SalesOrderID] ASC
    )
)

CREATE TABLE [SalesOrderDetail](
    [SalesOrderID] [int] NOT NULL,
    [SalesOrderDetailID] [int] IDENTITY(1,1) NOT NULL,
    [CarrierTrackingNumber] [nvarchar](25) NULL,
    [OrderQty] [smallint] NOT NULL,
    [ProductID] [int] NOT NULL,
    [SpecialOfferID] [int] NOT NULL,
    [UnitPrice] [money] NOT NULL,
    [UnitPriceDiscount] [money],
    [LineTotal]  AS (isnull(([UnitPrice]*((1.0)-[UnitPriceDiscount]))*[OrderQty],(0.0))),
     CONSTRAINT [PK_SalesOrderDetail_SalesOrderID_SalesOrderDetailID] PRIMARY KEY CLUSTERED
    (
        [SalesOrderID] ASC,
        [SalesOrderDetailID] ASC
    )
)

From above tables, hubs, links and satellites are as below

  • Hubs (HProduct, HProductSubCategory, HProductCategory)
  • Links (LProducts, LOrderHeader, LOrderDetails)
  • Satellites (SHProudct, SHProductSubCategory, SHProductCategory, SLOrderHeader, SLOrderDetails)

Create Table [HProductCategory]
(
    HProductCategoryKey int identity(1,1) NOT NULL constraint  pk_hProductCateogry primary key clustered,
    ProductCategoryID int not null,
    LoadDate Timestamp,
    RecordSource varchar(100)
)

Create Table [SHProductCategory]
(
    HProductCategoryKey int not null,
    LoadDate datetime not null,
    [Name] nvarchar(100) not null
    constraint PK_SHproductCategory Primary Key Clustered
    (
        HProductCategoryKey,
        LoadDate
    )
)

Create Table HProductSubCategory(
HProductSubCategoryKey int identity(1,1) NOT NULL constraint PK_HProductSubCategory primary key clustered,
ProductSubCategoryID int NOT NULL,
LoadDate Timestamp,
RecordSource varchar(100)
)

Create Table SHProductSubCategory(
HProductSubCategoryKey int not null,
LoadDate TimeStamp not null,
Name varchar(100),
constraint [PK_ProductSubcategory] primary key clustered
    (
        HProductSubCategoryKey,
        LoadDate
    )
)

Create table HProduct
(
    HProductKey int identity(1,1) not null constraint PK_HProductKey primary key clustered,
    [ProductID] [int] IDENTITY(1,1) NOT NULL,
    LoadDate Timestamp,
    RecordSource varchar(100)

)

Create table SHProduct
(
    HProductKey int identity(1,1) not null,
    LoadDate datetime not null,
    [Name] nvarchar(100) NOT NULL,
    [ProductNumber] [nvarchar](25) NOT NULL,
    [Color] [nvarchar](15) NULL,
    [SafetyStockLevel] [smallint] NOT NULL,
    [ReorderPoint] [smallint] NOT NULL,
    [StandardCost] [money] NOT NULL,
    [ListPrice] [money] NOT NULL,
    [Size] [nvarchar](5) NULL,
    [SizeUnitMeasureCode] [nchar](3) NULL,
    [WeightUnitMeasureCode] [nchar](3) NULL,
    [Weight] [decimal](8, 2) NULL,
    [DaysToManufacture] [int] NOT NULL,
    [ProductLine] [nchar](2) NULL,
    [Class] [nchar](2) NULL,
    [Style] [nchar](2) NULL,
     [SellStartDate] [datetime] NOT NULL,
    [SellEndDate] [datetime] NULL,
    [DiscontinuedDate] [datetime] NULL,
    Constraint [PK_SHProductKey] Primary Key Clustered
    (
        HProductKey,
        LoadDate
    )
)

 

Create Table LProducts
(
    LProductsKey int identity(1,1) not null constraint PK_Products Primary key Clustered,
    HProductKey int not null,
    HProductSubCategoryKey int not null,
    HProductCategorykey int not null,
    LoadDate datetime not null,
    RecordSource datetime not null
)

Create table LOrderHeader
(
    LOrderHeaderKey int identity(1,1) not null constraint pk_LOrderHeaderKey Primary key Clustered,
    [SalesOrderID] int not null,
    [CustomerID] [int] NOT NULL,
    [SalesPersonID] [int] NULL,
    [TerritoryID] [int] NULL,
    [BillToAddressID] [int] NOT NULL,
    [ShipToAddressID] [int] NOT NULL,
    [ShipMethodID] [int] NOT NULL,
    [CreditCardID] [int] NULL,
    [CurrencyRateID] [int] NULL
)

Create table SLOrderHeader
(
    LOrderHeaderKey int not null,
    [OrderDate] [datetime] NOT NULL ,
    [SalesOrderNumber]  NOT NULL,
    [CreditCardApprovalCode] [varchar](15) NULL,
    [SubTotal] [money] NOT NULL,
    [TaxAmt] [money] NOT NULL ,
    [Freight] [money] NOT NULL ,
    [TotalDue]  AS (isnull(([SubTotal]+[TaxAmt])+[Freight],(0))),
    [Comment] [nvarchar](128) NULL
)

Create table LOrderDetail
(
    LOrderDetailKey int identity(1,1) constraint PK_OrderDetailKey primary key clustered,
    LOrderHeaderKey int not null,
    [SalesOrderDetailID] [int]  NOT NULL,
    LProductsKey int not null,
    [SpecialOfferID] [int] NOT NULL
)

Create table SLOrderDetail
(
    LOrderDetailKey int not null,
    [CarrierTrackingNumber] [nvarchar](25) NULL,
    [OrderQty] [smallint] NOT NULL,
    [UnitPrice] [money] NOT NULL,
    [UnitPriceDiscount] [money],
    [LineTotal]  AS (isnull(([UnitPrice]*((1.0)-[UnitPriceDiscount]))*[OrderQty],(0.0))),
)

This completes my first attempt to try to design a DV DWH model from 3NF of OLTP systems. But during design there are few questions that cropped up which for now are ignored and purpose is to address  / understand them in next couple of posts. Few of questions are below

  • If there is a parent and child relationship in tables,
    • In DV these need to be flattened in two ways
      • including parent attributes in Hubs or
      • create a link table that capture relationships between parent and child.
        • When creating do we need to create an additional link table or link table that is created for transactions will suffice?
          • Pros and cons of new linked vs leveraging existing link table.
  • An attribute is present in transaction table
    • What decides when to place attribute link or satellite tables.

Few of such questions are pondering that need to be learnt also I need to present above sql scripts in more presentable visual format.

Until my next learning

Abhyast

Advertisements

“Day 3: Learn 3 Things A Day – Learning 1”

What do I want from DWH architecture?

Reading Data Vault modeling lead me to believe what do we require from DWH modeling / architecture. These requirements may or may not be practical but nevertheless would like to understand myself what exactly is need from a data warehouse architecture. Also these are non functional requirements that are generic and can be applied to all verticals

Prior to detailing requirements, let me define problem space

  • DWH should be place where enterprise data should be consolidated.
  • Data is generated by LOB applications.
  • Existing applications are modified to meet new business changes and requirements.
  • Few applications are retired and at same time new applications are introduced.
  • Business partners will have different application landscapes that organizations will need to integrate
  • With new technologies possibility of rate of changes have increased and modeling should be handled.
  • All above requirements can be summarized as a new data warehouse models should handle
    • Volume (V) high volume of data
    • Variety (V) data from disparate data sources (both internal or external applications)
    • Velocity (V) rate of data flow is high
  • In addition to above a data warehouse architecture should be able to absorb changes to structure (format) of source data at high rate.

Requirements can be split into two categories.

  1. Modeling requirements
  2. Data movement requirements

Modeling Requirements:

  • Encompassing & integrated:
    • All possible sources of data should be incorporated
    • Include attributes for every table from every LOB applications
    • Data from every source application should be merged (either as-is or de-duplicated)
    • Integrate data from third party applications both internal and external to organization
  • Flexible:
    • DWH model should withstand application changes
      • Applications may have features enabled or disabled (with flags) resulting in different data generated.
      • Existing features may be modified or new features added again resulting changes to data format
      • Applications may be retired resulting in data not generated by application but data collected till data need to be still retained by DWH
  • Extendable: DWH model should be extendable and new entities may be added
    • New application may be introduced that will bring
      • New masters (contexts to slice transactions in new ways)
      • New transactions
    • All these should be used on conjunction with existing data.
  • Simplistic & Performant:
    • Should be simple for end users to perform adhoc analysis and hypothetical testing
    • Should be highly performant (that promotes adhoc analysis)
  • Traceability:
    • Data flow from source should be traced and audited

Data movement requirements:

Data from source system needs movement to DWH and loaded into set of tables in DWH model.

  • Considering volume, velocity and variety of data generated by source system, data flow architecture should be scalable and preferably horizontally scaled (rather than scaleup).
  • Additionally with few data sets (like comments) there could be a need to aggregate data during data movement before persisting in DWH model.
  • Development of data movement packages should go in tandem with modeling changes. And similar to modeling requirement, package development should be a append only mode with new packages added (instead of modifying existing packages)

DU = Deployable Unit

A deployable unit should include

  • Model changes (DB Scripts)
  • Packages for movement of data

In summary architecture should be have high performance that enables deployment of independent DU (deployable unit of BI component) which can be turned off or on as required with builtin features like traceability, audit, historical.

Admittedly some of these requirements may be idealistic and not practical but I would like to ideate on this further especially “Deployable Unit” in DWH.

“Learn 3 things A Day – Learning 3”

Sentinel: Automatic failover system in Redis

High Availability options in Redis include “Clustering” and “Master Slave” topologies. Clustering is an extension of multiple “Master Slave” sets, where data is partitioned into mutually exclusive data sets and stored in dedicated “Master Slave(s)” sets.

In Redis all these three tasks are performed by a separate process “Sentinel”. Sentinels can be run on same server as Redis or on an independent servers.

Master – Slave Topology: In Master Slave topology, applications connect to master Redis instance to perform read – write operations. Data thus generated will be asynchronously replicated to Slave Redis instance. To persist data to disk, Redis provides features like Snapshot and AOF (Append Only File). But such persistence of data to disk is for disaster recovery purposes though could be used for High Availability.

As shown below data generated by multiple clients is written to Master, that ultimately (due to asynchronous nature) gets replicated to Slave. Sentinel process(es) (could be more than 1) can be configured to monitor multiple master sets.

  • M1…Mn = Masters
  • S1…Sn = Slaves
  • C1..Cn = Sentinels

image 

Redis Clustering: Redis clustering is extending multiple Master – Slave sets to store portion of data. Simple clustering topology is depicted below. Notice in below topology data is split to multiple disjoint sets and each set is stored in a different Master Slave sets.

image

As per Wikipedia, http://en.wikipedia.org/wiki/High_availability, three principles of high availability engineering are

  • Elimination of Single point of failure
    • This means adding redundancy to the system so that failure of a component does not mean failure of the entire system.
  • Reliable crossover
    • In multithreaded systems, the crossover point itself tends to become a single point of failure. High availability engineering must provide for reliable crossover.
  • Detection of failures as they occur
    • If the two principles above are observed, then a user may never see a failure. But the maintenance activity must.

A more practical requirements for system to highly available:

  • Monitoring is to be enabled
  • Failure Detection. To detect failures or perceived failed scenarios
  • Notification, Action for automatic correction of situation (generally failover)

Example of High Availability in SQL Server on Windows Clustering:

  • Monitoring: SQL Resource executable that is installed as part of cluster installation monitors health of SQL Server instance.
    • In earlier versions of SQL it performs IsAlive and LooksAlive checks.
  • Failure Detection: If multiple IsAlive and LooksAlive checks fail it is considered as SQL Server instance failure
  • Decisive Action: Failover SQL Server from Active to Passive windows node

Nature of failures that systems have to withstand:

  • Executable crash
    • Redis may fail (improper exception handling??)
    • Sentinel may crash
  • System crash due to H/W or OS related issues.
    • In such a scenario Redis instances and any monitoring Sentinels present will fail.
  • Network disruptions
    • Set of systems of same network may become unavailable (even though Redis and Sentinel instances may be working)
    • Network disruptions (also terms as network partitions) can be across multiple networks.

Sentinels should detect failures of Redis and take appropriate actions.

How does Sentinel monitor and failover Redis instance?

  • Sentinel instances ping Redis instance and queries port of master Redis instance. Failure to contact / query master Redis instance is considered failure.
  • To avoid fault positives of Redis failures, multiple sentinels monitor same master. All sentinels cooperate (called Quorum) to declare master has failed and trigger failover.
  • But prior to failover, majority number of sentinels need to be present.

This is just a tip of iceberg on failover of Redis.. Need to learn more.

–Abhyast

“Learn 3 things A Day – Learning 2”

I tried so many ways to kick out lethargy in me and force me to blog. This is one more thought that I had. Every day I would think / learn / understand about 5 new things and blog about those 5 new things.. See if this initiative I can make it into a habit.

Mathematical thought “why in a distributed system there will always be a definite failure of at the least one system but very remote chances for entire system failure”

It always intrigued me when in documentation / presentation it is mentioned “In a highly distributed systems, there is / are always failed single or multiple systems somewhere but entire solution should withstand such failures”. Today I read it again in documentation of Redis, so instead of going through document, I digressed on what it meant and finally came up with below answer :).

 

Assume an unreliable system, that keeps failing frequently. Now someone asks question “Will it fail tomorrow?”. Answer would be probably but not sure. Mathematically it implies that there is 50% (1/2) chance that system will fail and same chance that it will not fail.

P(Failure) = 1/2 and P(Not Failure) = 1/2

To increase reliability we add one more system with same reliability i.e, even for second system

P(Failure) = 1/2 and P(Not Failure) = 1/2

 

Assuming these two systems are independent of each other:

  • Probability that both systems fail at same time = P(Failure (A)) and P(Failure (B)) = 1/2 * 1/2 = 1/4 = 25%.
  • Probability that either of system fail = P(Failure (A)) or P(Failure (B)) = 1/2 + 1/2 = 1 = 100% :).

 

Now assume if we have 10s of such unreliable system then

  • Probability of all system fail at same time  = 1/(2^10) = .1% (failure) ie 99.9% System as a whole would be available.
  • Probability of atleast one system failure = 1/2 + 1/2 + 1/2……1/2 > 1. Implies there would be a definite failure of atleast one system.

 

Admittedly this is a very novice understanding and there could be more complexities involved in real world computations like conditional probability, but this atleast for new suffices as an answer to question that I always left me wondering..

 

May be if I can expand further, if we give x as reliability of individual systems (by prediction) and required reliability from entire systems, may be such a computation can be reversed engineered to some meaningful calculations..

 

Till my next learning…

Abhyast

“Learn 3 things a Day – Learning 1”

I tried so many ways to kick out lethargy in me and force me to blog. This is one more thought that I had. Every day I would think / learn / understand about 5 new things and blog about those 5 new things.. See if this initiative I can make it into a habit.

Data Vault (DV) modeling:

Start with Disclaimer 🙂 :

Last 1 week (inclusive of today), I have read and tried to understand “Data Vault” modeling, though admittedly not experienced it enough to comment. I always believed that learning / understanding is multiple degrees ranging from “Read –> Understand –> Explain –> Experience”, with “Read” & Experience” being lowest and highest levels of learning. With regards to DV modeling, I DO NOT have enough implementation experiences to give a fair comparison and comment. Treat below as my learning and not any passionate wars :).

Objective of Data Vault (DV) modeling: Based on readings from multiple blogs,

  • DV modeling tries to retain data state in “As-Is” form, from source systems avoiding costly and time consuming conversions when loading data into DWH data models
    • Loading into Star Schema includes format (structure) conversions and data quality checks that delays entire process (Design of data models & load of data in dimension models)
  • Detailed oriented. Have same grain as source systems thus avoiding “On the fly” aggregation in data model.
    • On system with high velocity of data flow (like Telecom, Weblogs), general practice (though optional) is to aggregate data to higher levels to aid in reports (but loosing detailed oriented data permanently)
  • Agile (both in modeling and loading).
    • Minimal transformation of data format from source
    • Agile BI vs Waterfall BI
  • Ability to Audit and track changes

 Basics of DV modeling: Data Vault modeling consists of

  • Hub: Business Keys + Surrogate Keys + Source Systems & Timestamps
  • Links: Capture relationships between Hubs
  • Satellites: Capture context of Hubs and Links.

 Comparing with Dimension Model:

Dimension Model Data Vault Model Remarks
Dimension Hub + Satellite  
Fact Link + Satellite + Other Links if needed  

DV model is best suited at between 3NF of source systems and  Star Schemas (2NF (Dimensions) and 3NF (Facts) of Dimension model). IMHO, transformation between source systems (3NF) to DV model should be easier as there will be minimal lookups / transformations and conversions. But DV model can not be exposed directly to consumers of DWH service. So instead a Dimensional model (Physical or virtual) need to be built. Physical being converting DV to a dimension model and Virtual implying creating views on top of DV models. Recommendation should be creating a virtual data model (using Views) as this aligns with agile developmental goal of DV modeling. So, if Dimensional Model is modeling for reporting, DV modeling is middle layer model consuming data from backend modeling that is more a 3NF modeling. Microsoft SQL Server Analysis services, internally maintains KStore, RStore, HStores. KStore is analogous to HUB and RStore is analogous to LINK but in SSAS these are maintained for attributes in a dimension and DV it is for across dimensions.

At high level this is what I understood about how DV stands with respect to other DWH models. I am still in midst of reading / understanding and internalizing various articles listed below

 

  • Some questions that I have lingering?
    • How does DV deal with SCD type 2, Type 3 dimensions?
    • How does DV modeling deal with different fact types (Transactional, Snapshot etc)?
    • Is it Agile in true sense and can model be developed iteratively?
    • Is Data Vault model suited to new age companies with MPP and Hadoop architectures??
    • What are pitfalls with DV models that can be envisaged and avoided?

Need to learn and think more.. moving to 2nd learning / thought of day..

Happy Learning………….

–Abhyast