The sum of all SQL Server matter that exist,
and the space in which all SQL Server events
occur or could occur.
Welcome to SQL Server Universe.com Sign in | Join | Help
Home SS SLUG Forums Articles Photos Downloads

Populating Dimension Tables with the SCD Task by Gogula Aryalingam


  

[ RATE this ARTICLE, and COMMENT on it ]

While building ETLs for a data warehouse or a data mart; one of the most interesting yet time consuming ETLs would be the design and implementation of Dimension ETLs. Whilst building these dimensions using SQL Server 2000 DTS, though a lot of thought and algorithms may have had to go into them, the fact remained that the transformation scripts that I had to write tended to get monotonous and yes, a little too lengthy. Integration Services from SQL Server 2005 has made this a lot simpler.

In the projects that I have worked on, almost always it would be Slowly Changing Dimensions (SCDs) that would be numerous. It is this, which I wish to take into consideration, and demonstrate how I do the ETLing of SCDs on SQL Server 2005 Integration Services.

Prologue

The term Slowly Changing Dimension was not found in SQL Server 2000 though it was theoretically available in data warehousing terminology, nor was the term Rapidly Changing Dimension. In SQL Server 2000, dimensions were considered as ‘just' dimensions. But in SQL Server 2005, the concept of dimensions has been given a slightly better understanding and is divided into three sections: Regular Dimensions, Slowly Changing Dimensions and Rapidly Changing Dimensions. Though they are basically the same, the way they are implemented varies. Referring Books Online would give you a simple and easy read where you would be able to get the picture quite clearly.

Slowly Changing Dimensions

Slowly Changing Dimensions (SCDs) are dimensions which would have their attributes and/or hierarchy changing over time. There are three types of SCDs, simply known as Type I, Type II, and Type III. A Type I SCD is similar to a regular dimension; where historical changes are not maintained; data changes are overwritten. A Type II SCD requires that historical changes are kept track of. This is done by adding a new record every time a historical attribute changes. Type III SCDs are similar to Type II SCDs except that, only the original record and the last record are kept. Records that contain changes in between are discarded.

The Scenario

Let us take a little scenario to aid us visualize the whole thing. We shall use the business scenario of the mythical Malabar Pharmaceuticals who distribute their pharmaceuticals to different parts of the country. The distribution is done through Sales Representatives who serve in different regions. These sales reps are occasionally transferred to other regions. Therefore the management of Malabar would like to see the performance of these sales reps based on the Regions they have worked in and are currently working in. This gives us a classic example of a Type II SCD. See the table below on how the dimension table DimSalesPerson is designed:

ID

SalesRepID

Name

Region

StartDate

EndDate

...

1

SR001

Amy Lee

North

1/1/2005 11:45:47 AM

NULL

...

2

SR010

Gwen Stephen

North

3/1/2006 12:50:32 AM

5/30/2006 05:45:12 PM

...

3

SR010

Gwen Stephen

Central

5/30/2006 05:45:12 PM

NULL

...

4

SR013

Brett Wilkinson

South

3/1/2006 12:50:32 AM

7/31/2007 12:00:33 PM

...

5

SR200

Janice Jameson

East

9/25/2006 12:48:47 AM

NULL

...

6

SR013

Brett Wilkinson

West

7/31/2007 12:00:33 PM

NULL

...

 

In the above table example, ID is the surrogate key (which would later link to the fact table). SalesRepID is the sales reps' unique identifier whilst Name and Region are self explanatory. The StartDate and EndDate fields are used for keeping track of historical changes. They indicate the period when the sales rep worked in a particular region. The records containing a NULL EndDate indicate the current current (or valid) record.

The OLTP database (the source of all the data) contains two tables among others which store the Region and SalesRep information. They are as following:

Regions Sales Persons

RegionID (PK)

RegionName

 

SalesRepresentativeID (PK)

SalesRepresentativeName

RegionID (FK)

1

North

SR001

Amy Lee

1

2

South

SR010

Gwen Stephen

3

3

Central

SR013

Brett Wilkinson

1

4

East

SR168

Michael Harper

4

5

West

SR200

Janice Jameson

5

So, what we have to do now is build an ETL to populate DimSalesRep with data from Region and SalesRep tables, also keeping in mind that historical changes are also to be maintained.

If I had to do this transformation using DTS, I would need to write a lot of ActiveX codes to: check if the data to be inserted already exists in the dimension table; and if so see if it is different and update it. If the record does not exist, then go ahead and insert it. But that alone would not suffice; I may have to lookup related tables to get in values for referential integrity.

So then what does SSIS offer us? Move ahead and find out for yourself...

  1. Create a new Integration Services package in Business Intelligence Management Studio.
    1. Click on FileNewProject....
    2. Under Business Intelligence Projects project type select Integration Services Project.
    3. Type a name for the project in the Name field, select a location to save the project in, and click on OK.
    4. This will create a new project and open the default package Package1.dtsx.
  2. Create a Connection Manager for the package to connect to the OLTP database.
    1. Right click in the Connection Manager section, and select New OLE DB Connection....
    2. Click on New.
    3. Enter credentials to login to the database server and select DimTransDemo from the database list under Select or enter a database name. Test the connection and click on OK.
    4. Click on OK again in the Configure OLE DB Connection Manager dialog.
    5. Rename the created connection manager to a more user-friendly name, such as OLTP_DB.
  3. Using the same method, create a Connection Manager that connects to the DimTransDemoDW database (which will act as the data warehouse); name this connection manager as Warehouse.
  4. Create the data transformation
    1. From the toolbox, drag and drop a Data Flow Task onto the Control Flow section. Name it as SalesRepTransfer.
    2. Double-click on SalesRepTransfer to go into the Data Flow tab.
  5. Create and configure the data flow source
    OLE DB Data Source
    1. In the Data Flow tab, drag and drop an OLE DB Source from the Data Flow Sources section of the toolbox, and rename it as OLTPSalesReps.
    2. Double-click on this OLE DB Source to get to its properties.
    3. In the OLE DB connection manager dropdown, make sure that OLTP_DB is selected.
    4. In the Data access mode dropdown, select SQL command, since we will write the query that will serve as the source of the dataflow.
    5. In the SQL command text window, type in the following query:
      SELECT
      SR.[SalesRepresentativeID],
      SR.[SalesRepresentativeName],
      R.[RegionName]
      FROM
      dbo.Region R
      INNER JOIN dbo.SalesRep SR ON R.[RegionID] = SR.[RegionID]
      This query joins data from the previously introduced two tables on the OLTP database. It will be the source for our transformation.
    6. Click on Parse Query to check for errors in the syntax, and the click on Preview to get a preview of the data returned by the query. Click on OK.
  6. Create and configure the transformation task
    OLE DB Data Source and SCD Task
    1. Drag and drop a Slowly Changing Dimension task onto the designer and rename is as SalesRepTransformation.
    2. Drag the success (green) precedence constraint from OLTPSalesReps (the OLE DB Source) and drop onto SalesRepTransformation.
    3. Double-click on SalesRepTransformation to initiate the Slowly Changing Dimension Wizard, so that we could configure the transformation.
    4. On the welcome page of the wizard click on Next.
    5. In the Select a Dimension Table and Keys screen; select the destination connection manager (Warehouse) from the Connection manager dropdown.
    6. In the Table or view dropdown select the destination table (DimSalesRep). This would load the fields of that table in the Dimension Columns field of the grid below it.
    7. In the Input Columns field select the appropriate input columns which have values for the dimension columns. Select SalesRepresentativeID as the Business key.
      Input Columns Selection screen
      Note: a business key is the field which identifies the record.
      Observe that EndDate and StartDate do not have any input values. They will be populated by the transformation depending on the freshness of the record. Now click on Next.
    8. In the Slowly Changing Dimension columns screen, select the type of change for the columns. Since the Name field can change (E.g. due to a spelling correction) let us select the type as Changing attribute, but since the Region data has to be historically maintained (i.e. a new record every time a sales rep changes locations) let us select the type as Historical attribute. Click on Next.
      Slowly Changing Dimension columns
    9. In the Fixed and Changing Attribute Options page check the box under Changing attributes. Click on Next.
    10. You now come to the Historical Attributes Options screen. Here we could select one of two mechanisms to record your history of the dimension table.
      1. The first option requires you to have a Boolean field in the dimension table which would indicate if the record is current or not.
      2. The second option requires you to have two fields in the dimension table that would record the starting date and ending date of the validity of the record.
    11. Since we have the StartDate and EndDate fields, let us choose the second option button on the screen.
    12. Specify the appropriate values for Start date column and End date column, and select System::StartTime under the Variable to set date values dropdown. This would use the Start time of the package as the date value to be recorded. Click on Next.
      Historical Attributes Options screen
    13. In the Inferred Dimension Members screen, uncheck the checkbox and click on Next. Click on Finish on the last screen. The wizard automatically does the rest!
      Final Output
      Notice that there are two destinations created: one for updates and another for inserts. I shall leave you with the pleasure of exploring the rest of the tasks that the wizard has laid down.
  7. Finally, let us test the transformation.
    1. Run the package, all Sales representatives would be inserted into the dimension table. Note that the StartDate field now has a value, but EndDate does not. That is because all the records are valid from the start date but have no end date yet.
    2. Now, change the RegionID of any one Sales Rep at the OLTP side, and rerun the package. You will notice that a new record is inserted for that particular Sales Rep for a new region. The previous record of the same Sales Rep will be updated with an end date.
    3. Lastly, Change the name of a Sales Rep at the OLTP side, and rerun the package. You will notice that the Sales Rep's name get updated in the dimension table, but no new record is added.

Conclusion

The Slowly Changing Dimension transformation task gives you the freedom to focus on other parts of your ETL, since once the configuration is done using the wizard, most of the things that you had to do using extensive, time-consuming ActiveX scripting in DTS, takes just a few clicks and minutes.

[ RATE this ARTICLE, and COMMENT on it ]

Powered by Community Server, by Telligent Systems