Slowly Changing Dimensions | Data Warehousing
Data warehouses include data from databases as well as operational and external sources. They are traditionally built on relational databases, but can built on multidimensional; you can think of the database as the platform (‘base’) and the warehouse as the usage.
Data warehouses access and copy data from sources using ETL integrations. ETL stands for ‘Extract’, ‘Transform,’ and ‘Load.’ Combining information from all sources allows businesses to make true, holistic data-driven decisions from the one-stop shop that is a data warehouse.
A slowly changing dimension (SCD) allows businesses to capture changes in their data over time. In this blog, I am going to cover Type 1, 2, and 3 SCDs.
Type 1 Slowly Changing Dimension
Type 1 SCD is often called an ‘in-place’ ETL pattern. It is the simplest SCD, but only works for certain use-cases. While it replaces the old value with the new, it does not maintain any history.
Common Use-Case: correcting errors or making changing to data you do not need historic records of.
Example: Let’s say you have a dimension for students that records their name and year, alongside a primary key and identifier.
If you added a new student but misspelled their name, a Type 1 SCD would work perfectly.
There would be no record of the initial recording, but that most likely is not needed as the entry was a clerical error.
Type 2 Slowly Changing Dimension
Type 2 SCD is referenced as the ‘as-is’ pattern. This is because registering a change adds a completely new dimension table row while leaving the existing row as-is. If Type 1 is ‘too hot’ in the Goldilocks metaphor, Type 2 is ‘too cold.’
Ok — that analogy makes no sense but my point is these two are vastly different. Type 2 SCD logs the most robust history, but is the most architecturally complex. As well, you need to be wary of reporting and analytics with this type.
Common Use Case: want to be able to report and run analysis on data even with relevant changes
Example: You want to keep tracking of MLB player’s hitting statistics across their career.
Praying it never happens, but let’s say our lifelong Yankee got traded to another team. In this case, you’d want to hold a record of Brett’s performance on the Yankees and also on his new team.
You can see a few things here. 1) The first record stayed within the database. 2) For the new record, a new key had to be generated but the ID of the player stayed the same. 3) We kept and updated a simple boolean Y or N to indicate whether the record was the active association.
#3 above is extremely important. Type 2 SCD does not automatically maintain the order of new and old versions. When running queries, this flag would allow you to do things like check the averages of all players currently on the Yankees.
Type 3 Slowly Changing Dimension
Type 3 SCD creates multiple versions of history within your dimension tables. Where Type 2 added a new row to row to reflect a change, Type 3 adds a new column.
Common Use Case: reorganization
Example: You want to record what conferences a college football program has been a part of.
However, TCU switched conferences in 2012 — what are we going to do?!
You can add 3–4 columns when new changes occur. However, for any more detail you are better off using a Type 2 SCD. Type 3 SCD’s are not utilized frequently.
There are plenty of other slowly changing dimension options out there; these options are the most common.