“DAY 3: LEARN 3 THINGS A DAY – LEARNING 2”

Data Vault Series 1– Data Vault Overview

http://tdan.com/data-vault-series-2-data-vault-components/5155

In DV model task of data transformation, cleansing and de-duplication is moved from Source to DWH (Data Warehouse) to DWH to DM (Data Mart layer). Thus effort needed to move data from disparate source systems to centralized DWH model becomes simpler . As I understood till now

 

  • Using DV modeling simplify, expedite data modeling and loading data from source system to DW model.
  • Simply modeling to dimension models for data marts preferably by using views (for faster loading into dimension model) unless performance becomes an issue.

As detailed in previous learning post, main entities of DV modeling are

  • Hubs
  • Links
  • Satellites

For this learning I am using AdventureWorks database. Picked few tables with scripts below

CREATE TABLE [ProductCategory](
    [ProductCategoryID] [int] IDENTITY(1,1) NOT NULL,
    [Name] [dbo].[Name] NOT NULL,
     CONSTRAINT [PK_ProductCategory_ProductCategoryID] PRIMARY KEY CLUSTERED
    (
        [ProductCategoryID] ASC
    )
)

CREATE TABLE [ProductSubcategory](
    [ProductSubcategoryID] [int] IDENTITY(1,1) NOT NULL,
    [ProductCategoryID] [int] NOT NULL,
    [Name] varchar(100) NOT NULL,
    CONSTRAINT [PK_ProductSubcategory_ProductSubcategoryID] PRIMARY KEY CLUSTERED
    (
        [ProductSubcategoryID] ASC
    )
)

CREATE TABLE [Product](
    [ProductID] [int] IDENTITY(1,1) NOT NULL,
    [Name] nvarchar(100) NOT NULL,
    [ProductNumber] [nvarchar](25) NOT NULL,
    [Color] [nvarchar](15) NULL,
    [SafetyStockLevel] [smallint] NOT NULL,
    [ReorderPoint] [smallint] NOT NULL,
    [StandardCost] [money] NOT NULL,
    [ListPrice] [money] NOT NULL,
    [Size] [nvarchar](5) NULL,
    [SizeUnitMeasureCode] [nchar](3) NULL,
    [WeightUnitMeasureCode] [nchar](3) NULL,
    [Weight] [decimal](8, 2) NULL,
    [DaysToManufacture] [int] NOT NULL,
    [ProductLine] [nchar](2) NULL,
    [Class] [nchar](2) NULL,
    [Style] [nchar](2) NULL,
    [ProductSubcategoryID] [int] NULL,
    [ProductModelID] [int] NULL,
    [SellStartDate] [datetime] NOT NULL,
    [SellEndDate] [datetime] NULL,
    [DiscontinuedDate] [datetime] NULL,
    CONSTRAINT [PK_Product_ProductID] PRIMARY KEY CLUSTERED
    (
        [ProductID] ASC
    )
)

CREATE TABLE [SalesOrderHeader](
    [SalesOrderID] [int] IDENTITY(1,1) NOT NULL,
    [OrderDate] [datetime] NOT NULL ,
    [SalesOrderNumber]  NOT NULL,
    [CustomerID] [int] NOT NULL,
    [SalesPersonID] [int] NULL,
    [TerritoryID] [int] NULL,
    [BillToAddressID] [int] NOT NULL,
    [ShipToAddressID] [int] NOT NULL,
    [ShipMethodID] [int] NOT NULL,
    [CreditCardID] [int] NULL,
    [CreditCardApprovalCode] [varchar](15) NULL,
    [CurrencyRateID] [int] NULL,
    [SubTotal] [money] NOT NULL,
    [TaxAmt] [money] NOT NULL ,
    [Freight] [money] NOT NULL ,
    [TotalDue]  AS (isnull(([SubTotal]+[TaxAmt])+[Freight],(0))),
    [Comment] [nvarchar](128) NULL,
    CONSTRAINT [PK_SalesOrderHeader_SalesOrderID] PRIMARY KEY CLUSTERED
    (
        [SalesOrderID] ASC
    )
)

CREATE TABLE [SalesOrderDetail](
    [SalesOrderID] [int] NOT NULL,
    [SalesOrderDetailID] [int] IDENTITY(1,1) NOT NULL,
    [CarrierTrackingNumber] [nvarchar](25) NULL,
    [OrderQty] [smallint] NOT NULL,
    [ProductID] [int] NOT NULL,
    [SpecialOfferID] [int] NOT NULL,
    [UnitPrice] [money] NOT NULL,
    [UnitPriceDiscount] [money],
    [LineTotal]  AS (isnull(([UnitPrice]*((1.0)-[UnitPriceDiscount]))*[OrderQty],(0.0))),
     CONSTRAINT [PK_SalesOrderDetail_SalesOrderID_SalesOrderDetailID] PRIMARY KEY CLUSTERED
    (
        [SalesOrderID] ASC,
        [SalesOrderDetailID] ASC
    )
)

From above tables, hubs, links and satellites are as below

  • Hubs (HProduct, HProductSubCategory, HProductCategory)
  • Links (LProducts, LOrderHeader, LOrderDetails)
  • Satellites (SHProudct, SHProductSubCategory, SHProductCategory, SLOrderHeader, SLOrderDetails)

Create Table [HProductCategory]
(
    HProductCategoryKey int identity(1,1) NOT NULL constraint  pk_hProductCateogry primary key clustered,
    ProductCategoryID int not null,
    LoadDate Timestamp,
    RecordSource varchar(100)
)

Create Table [SHProductCategory]
(
    HProductCategoryKey int not null,
    LoadDate datetime not null,
    [Name] nvarchar(100) not null
    constraint PK_SHproductCategory Primary Key Clustered
    (
        HProductCategoryKey,
        LoadDate
    )
)

Create Table HProductSubCategory(
HProductSubCategoryKey int identity(1,1) NOT NULL constraint PK_HProductSubCategory primary key clustered,
ProductSubCategoryID int NOT NULL,
LoadDate Timestamp,
RecordSource varchar(100)
)

Create Table SHProductSubCategory(
HProductSubCategoryKey int not null,
LoadDate TimeStamp not null,
Name varchar(100),
constraint [PK_ProductSubcategory] primary key clustered
    (
        HProductSubCategoryKey,
        LoadDate
    )
)

Create table HProduct
(
    HProductKey int identity(1,1) not null constraint PK_HProductKey primary key clustered,
    [ProductID] [int] IDENTITY(1,1) NOT NULL,
    LoadDate Timestamp,
    RecordSource varchar(100)

)

Create table SHProduct
(
    HProductKey int identity(1,1) not null,
    LoadDate datetime not null,
    [Name] nvarchar(100) NOT NULL,
    [ProductNumber] [nvarchar](25) NOT NULL,
    [Color] [nvarchar](15) NULL,
    [SafetyStockLevel] [smallint] NOT NULL,
    [ReorderPoint] [smallint] NOT NULL,
    [StandardCost] [money] NOT NULL,
    [ListPrice] [money] NOT NULL,
    [Size] [nvarchar](5) NULL,
    [SizeUnitMeasureCode] [nchar](3) NULL,
    [WeightUnitMeasureCode] [nchar](3) NULL,
    [Weight] [decimal](8, 2) NULL,
    [DaysToManufacture] [int] NOT NULL,
    [ProductLine] [nchar](2) NULL,
    [Class] [nchar](2) NULL,
    [Style] [nchar](2) NULL,
     [SellStartDate] [datetime] NOT NULL,
    [SellEndDate] [datetime] NULL,
    [DiscontinuedDate] [datetime] NULL,
    Constraint [PK_SHProductKey] Primary Key Clustered
    (
        HProductKey,
        LoadDate
    )
)

 

Create Table LProducts
(
    LProductsKey int identity(1,1) not null constraint PK_Products Primary key Clustered,
    HProductKey int not null,
    HProductSubCategoryKey int not null,
    HProductCategorykey int not null,
    LoadDate datetime not null,
    RecordSource datetime not null
)

Create table LOrderHeader
(
    LOrderHeaderKey int identity(1,1) not null constraint pk_LOrderHeaderKey Primary key Clustered,
    [SalesOrderID] int not null,
    [CustomerID] [int] NOT NULL,
    [SalesPersonID] [int] NULL,
    [TerritoryID] [int] NULL,
    [BillToAddressID] [int] NOT NULL,
    [ShipToAddressID] [int] NOT NULL,
    [ShipMethodID] [int] NOT NULL,
    [CreditCardID] [int] NULL,
    [CurrencyRateID] [int] NULL
)

Create table SLOrderHeader
(
    LOrderHeaderKey int not null,
    [OrderDate] [datetime] NOT NULL ,
    [SalesOrderNumber]  NOT NULL,
    [CreditCardApprovalCode] [varchar](15) NULL,
    [SubTotal] [money] NOT NULL,
    [TaxAmt] [money] NOT NULL ,
    [Freight] [money] NOT NULL ,
    [TotalDue]  AS (isnull(([SubTotal]+[TaxAmt])+[Freight],(0))),
    [Comment] [nvarchar](128) NULL
)

Create table LOrderDetail
(
    LOrderDetailKey int identity(1,1) constraint PK_OrderDetailKey primary key clustered,
    LOrderHeaderKey int not null,
    [SalesOrderDetailID] [int]  NOT NULL,
    LProductsKey int not null,
    [SpecialOfferID] [int] NOT NULL
)

Create table SLOrderDetail
(
    LOrderDetailKey int not null,
    [CarrierTrackingNumber] [nvarchar](25) NULL,
    [OrderQty] [smallint] NOT NULL,
    [UnitPrice] [money] NOT NULL,
    [UnitPriceDiscount] [money],
    [LineTotal]  AS (isnull(([UnitPrice]*((1.0)-[UnitPriceDiscount]))*[OrderQty],(0.0))),
)

This completes my first attempt to try to design a DV DWH model from 3NF of OLTP systems. But during design there are few questions that cropped up which for now are ignored and purpose is to address  / understand them in next couple of posts. Few of questions are below

  • If there is a parent and child relationship in tables,
    • In DV these need to be flattened in two ways
      • including parent attributes in Hubs or
      • create a link table that capture relationships between parent and child.
        • When creating do we need to create an additional link table or link table that is created for transactions will suffice?
          • Pros and cons of new linked vs leveraging existing link table.
  • An attribute is present in transaction table
    • What decides when to place attribute link or satellite tables.

Few of such questions are pondering that need to be learnt also I need to present above sql scripts in more presentable visual format.

Until my next learning

Abhyast

Advertisements

“Day 3: Learn 3 Things A Day – Learning 1”

What do I want from DWH architecture?

Reading Data Vault modeling lead me to believe what do we require from DWH modeling / architecture. These requirements may or may not be practical but nevertheless would like to understand myself what exactly is need from a data warehouse architecture. Also these are non functional requirements that are generic and can be applied to all verticals

Prior to detailing requirements, let me define problem space

  • DWH should be place where enterprise data should be consolidated.
  • Data is generated by LOB applications.
  • Existing applications are modified to meet new business changes and requirements.
  • Few applications are retired and at same time new applications are introduced.
  • Business partners will have different application landscapes that organizations will need to integrate
  • With new technologies possibility of rate of changes have increased and modeling should be handled.
  • All above requirements can be summarized as a new data warehouse models should handle
    • Volume (V) high volume of data
    • Variety (V) data from disparate data sources (both internal or external applications)
    • Velocity (V) rate of data flow is high
  • In addition to above a data warehouse architecture should be able to absorb changes to structure (format) of source data at high rate.

Requirements can be split into two categories.

  1. Modeling requirements
  2. Data movement requirements

Modeling Requirements:

  • Encompassing & integrated:
    • All possible sources of data should be incorporated
    • Include attributes for every table from every LOB applications
    • Data from every source application should be merged (either as-is or de-duplicated)
    • Integrate data from third party applications both internal and external to organization
  • Flexible:
    • DWH model should withstand application changes
      • Applications may have features enabled or disabled (with flags) resulting in different data generated.
      • Existing features may be modified or new features added again resulting changes to data format
      • Applications may be retired resulting in data not generated by application but data collected till data need to be still retained by DWH
  • Extendable: DWH model should be extendable and new entities may be added
    • New application may be introduced that will bring
      • New masters (contexts to slice transactions in new ways)
      • New transactions
    • All these should be used on conjunction with existing data.
  • Simplistic & Performant:
    • Should be simple for end users to perform adhoc analysis and hypothetical testing
    • Should be highly performant (that promotes adhoc analysis)
  • Traceability:
    • Data flow from source should be traced and audited

Data movement requirements:

Data from source system needs movement to DWH and loaded into set of tables in DWH model.

  • Considering volume, velocity and variety of data generated by source system, data flow architecture should be scalable and preferably horizontally scaled (rather than scaleup).
  • Additionally with few data sets (like comments) there could be a need to aggregate data during data movement before persisting in DWH model.
  • Development of data movement packages should go in tandem with modeling changes. And similar to modeling requirement, package development should be a append only mode with new packages added (instead of modifying existing packages)

DU = Deployable Unit

A deployable unit should include

  • Model changes (DB Scripts)
  • Packages for movement of data

In summary architecture should be have high performance that enables deployment of independent DU (deployable unit of BI component) which can be turned off or on as required with builtin features like traceability, audit, historical.

Admittedly some of these requirements may be idealistic and not practical but I would like to ideate on this further especially “Deployable Unit” in DWH.

“Learn 3 things a Day – Learning 1”

I tried so many ways to kick out lethargy in me and force me to blog. This is one more thought that I had. Every day I would think / learn / understand about 5 new things and blog about those 5 new things.. See if this initiative I can make it into a habit.

Data Vault (DV) modeling:

Start with Disclaimer 🙂 :

Last 1 week (inclusive of today), I have read and tried to understand “Data Vault” modeling, though admittedly not experienced it enough to comment. I always believed that learning / understanding is multiple degrees ranging from “Read –> Understand –> Explain –> Experience”, with “Read” & Experience” being lowest and highest levels of learning. With regards to DV modeling, I DO NOT have enough implementation experiences to give a fair comparison and comment. Treat below as my learning and not any passionate wars :).

Objective of Data Vault (DV) modeling: Based on readings from multiple blogs,

  • DV modeling tries to retain data state in “As-Is” form, from source systems avoiding costly and time consuming conversions when loading data into DWH data models
    • Loading into Star Schema includes format (structure) conversions and data quality checks that delays entire process (Design of data models & load of data in dimension models)
  • Detailed oriented. Have same grain as source systems thus avoiding “On the fly” aggregation in data model.
    • On system with high velocity of data flow (like Telecom, Weblogs), general practice (though optional) is to aggregate data to higher levels to aid in reports (but loosing detailed oriented data permanently)
  • Agile (both in modeling and loading).
    • Minimal transformation of data format from source
    • Agile BI vs Waterfall BI
  • Ability to Audit and track changes

 Basics of DV modeling: Data Vault modeling consists of

  • Hub: Business Keys + Surrogate Keys + Source Systems & Timestamps
  • Links: Capture relationships between Hubs
  • Satellites: Capture context of Hubs and Links.

 Comparing with Dimension Model:

Dimension Model Data Vault Model Remarks
Dimension Hub + Satellite  
Fact Link + Satellite + Other Links if needed  

DV model is best suited at between 3NF of source systems and  Star Schemas (2NF (Dimensions) and 3NF (Facts) of Dimension model). IMHO, transformation between source systems (3NF) to DV model should be easier as there will be minimal lookups / transformations and conversions. But DV model can not be exposed directly to consumers of DWH service. So instead a Dimensional model (Physical or virtual) need to be built. Physical being converting DV to a dimension model and Virtual implying creating views on top of DV models. Recommendation should be creating a virtual data model (using Views) as this aligns with agile developmental goal of DV modeling. So, if Dimensional Model is modeling for reporting, DV modeling is middle layer model consuming data from backend modeling that is more a 3NF modeling. Microsoft SQL Server Analysis services, internally maintains KStore, RStore, HStores. KStore is analogous to HUB and RStore is analogous to LINK but in SSAS these are maintained for attributes in a dimension and DV it is for across dimensions.

At high level this is what I understood about how DV stands with respect to other DWH models. I am still in midst of reading / understanding and internalizing various articles listed below

 

  • Some questions that I have lingering?
    • How does DV deal with SCD type 2, Type 3 dimensions?
    • How does DV modeling deal with different fact types (Transactional, Snapshot etc)?
    • Is it Agile in true sense and can model be developed iteratively?
    • Is Data Vault model suited to new age companies with MPP and Hadoop architectures??
    • What are pitfalls with DV models that can be envisaged and avoided?

Need to learn and think more.. moving to 2nd learning / thought of day..

Happy Learning………….

–Abhyast