display | more...
One of the few poetically named concepts I have come across while working with huge blobs of data.

The concept of slowly changing dimensions, or SCD, was introduced by Ralph Kimball in his work on data warehousing.
It addresses the problem of data categories, or dimensions, being updated over time. It is strongly related to dimensional modeling and star schemas (star schemas being a database design solution based on the dimensional modeling pattern).

Understanding what a slowly changing dimension is and how it works may require a bit of background, so here's an example, rather lengthy I am afraid, of the kind of reasoning that leads to the concept's definition.
Let's start by considering a possible solution for storing a shop's credit card transaction data. We assume that the storage engine is some sort of relational database, and we choose to store everything in a single table, which we represent thus with the field names in the first row and example transaction values in the following rows:

                                 

+---------------------+-------------------+--------------------+--------------+------------------+--------------------+---------+
| timestamp           |credit card number |owner name          |owner zip code|transaction amount| item name          |item type|
+---------------------+-------------------+--------------------+--------------+------------------+--------------------+---------+
| 1971-01-09 20:32:00 | 45123123123123    | Mickey Mouse       | 101 64       | 42.50            | egg nog blender    | bar tool|
| 1971-01-09 20:32:00 | 33444222333233    | August Mittagessen | 621 01       | 99.00            | tio pepe dry sherry| booze   |
 ...
model 1

Such a table allows for interesting purchase patterns reports to be created. For example we could use the zip code field together with the item type to, say, target a geographic area where a class of products is selling poorly with tailored advertisement.

Another possible way to use the table could be to track the purchase amounts by geographic area over time, maybe in order to follow the purchase patten development in a shop's neighborhood.

There are a few problems with this schema though. A pressing one is that transaction tables tend to contain a huge number of records. Actually their size tends to dwarf anything else in a typical database. A way to minimize the issue could be to replace fields whose values tend to repeat themselves with shorter placeholders. In our example we can see that a probably limited number of customers, using a likewise limited number of credit cards, chooses to purchase from a limited collection of items a possibly _unlimited_ number of times. A solution in this case would be to create a new table for the card-related fields, and another for the item-related fields. The results would be something like this:

card table
+---------+-------------------+--------------------+---------------+
|card_fk  |credit card number |owner name          |owner zip code |
+---------+-------------------+--------------------+---------------+
| 1       |45123123123123     |Mickey Mouse        | 101 64        |
| 2       |33444222333233     |August Mittagessen  | 621 01        |
 ...

item table
+--------+--------------------+----------+
|item_fk |item name           |item type |
+--------+--------------------+----------+
| 1      |egg nog blender     |bar tool  |
| 2      |tio pepe dry sherry |booze     |
...

transaction table
+---------------------+--------+--------+------------------+
| timestamp           |card_fk |item_fk |transaction amount|
+---------------------+-----------------+------------------+
| 1971-01-09 20:32:00 | 1      | 1      | 42.50            |
| 1971-01-09 20:32:00 | 2      | 2      | 99.00            |
 ...

model 2

Another representation of the same model may shed some light on the reason why such a schema is called a 'star':

+------------------+                  +-------------+
|card              |                  |item         |
+------------------+                  +-------------+
|card_fk           |-|----+    +----|-|item_fk      |
+------------------+      |    |      +-------------+ 
|credit card number|     /|\  /|\     |item name    |
|owner name        | +--------------+ |item type    | 
|owner zip code    | |transaction   | +-------------+
+------------------+ +--------------+
                     |customer_fk   |
                     |item_fk       |
                     +--------------+
                     |timestamp     |
                     |amount        |
                     +--------------+

As the number of dimensions connected to our transaction table increases, the model start looking like a star of sorts, or at least so I am told. But I digress.

In 'model 2' above I replaced the card and item related fields in the transaction table with placeholders, numbers that uniquely identify records in the card and item tables respectively. Such placeholders are called 'foreign keys', hence the field names. Such groupings of related category fields are called data 'dimensions'.
Besides the space savings, this model has other advantages. Adding owner details in the first model, like the card owner's marital status for instance, would have meant to update each transaction record, probably a time-consuming and risky procedure (remember: transaction tables are _huge_). In model 2, such changes do not require touching the transaction table at all.

Updating dimension tables does have its quirks, though. Let's say that we were notified that a card owner moved to a new address and we wanted to update the zip code of her record. We could simply update the record, which means that all of the customer's transactions, past and future, would be connected to the new zip number. This is called a slowly changing dimension of type 1 (finally! :^) ).
Of course, if we only are interested in the customers' current residence this is perfectly fine. If we mean to use the database to, say, track transaction counts by geographic area over time though, our results will be skewed.

An alternative way to handle the situation is to create a new card owner record, a copy of the original excluding the zip code, which is given the new value. The new record gets a new, unique foreign key, and new transaction records are connected to the card table via the new foreign key. Lo! Our query is not skewed any longer.
This is called a slowly changing dimension of type 2.

A third way to handle our customer moving around is to create an alternative view of her record. For example we could design the card table something like this way:

card table
+---------+-------------------+--------------------+-----------------+------------------+
|card_fk  |credit card number |owner name          |current zip code |previous zip code |
+---------+-------------------+--------------------+-----------------+------------------+
| 1       |45123123123123     |Mickey Mouse        | 101 64          | 101 64           |
| 2       |33444222333233     |August Mittagessen  | 621 01          | 696 69           |
 ...


This is a slowly changing dimension of type 3.
By doing this we may ask our database questions like 'where do we find people that buy ungodly quantities of booze, where do they live _right now_?'. Knowing the time for the latest zip code update would also allow us to see how the answer to this question has changed over time.

There is of course more to SCDs, but this explanation is meant to be an introduction to the subject for people that are not actually involved in database design, and I will leave the more excruciatingly boring stuff out.
My hope is that these lines will be used to survive a conversation with somebody who actually works with this, and whom you, Oh reader, are interested in getting to know better. :^)

Log in or register to write something here or to contact authors.