Fervent Coder

Coding Towards Utopia...by Rob Reynolds
posts - 278 , comments - 431 , trackbacks - 0

My Links

News


Rob Reynolds

Subscribe to Fervent Coder RSS
Subscribe to Fervent Coder by Email

About Me

I manage several open source projects. Need...
   ...package management for Windows?
   ...automated builds?
   ...database change management (migrations)?
   ...your application to check email?
   ...a monitoring utility?

I also write for



Like what you are reading? Want to buy me a cup of coffee?
PayPal - The safer, easier way to pay online!

Archives

Sites

Data Warehouses Should Be Granularity Agnostic

When you are designing a data warehouse solution that publishes out data marts, always be sure that your warehouse is granularity agnostic.  Granularity means the level of detail you track (defined more in depth below). Being agnostic means that you don't care about or have knowledge of something. To be "granularity agnostic" means that your warehouse doesn't care or have knowledge of how granular your data is. In essence, you design it up front so that it can get as much or as little detail as possible.

For example, your warehouse could start out housing daily items.  At some point, you may have a need to trickle feed your warehouse (per events, possibly multiple times per day).  If your warehouse is granularity agnostic, you do not have to change your database design.  To do that, we borrow the concept of batching, which moves the level of detail away from time or any other element of granularity.  You can batch as much or as little as your want. That means we can pull in as much or as little detail as we need without making changes to our database design.  And it's so simple it doesn't really require big up front design.

This is how Wikipedia defines granularity of data:

The granularity of data refers to the fineness with which data fields are sub-divided. For example, a postal address can be recorded, with low granularity, as one field:

  1. address = 200 2nd Ave. South #358, St. Petersburg, FL 33701-4313 USA

or with high granularity, as many fields:

  1. street address = 200 2nd Ave. South #358
  2. city = St. Petersburg
  3. postal code = FL 33701-4313
  4. country = USA

or even higher granularity:

  1. street number = 200
  2. street = 2nd Ave. South #358
  3. city = St. Petersburg
  4. postal code state = FL
  5. postal-code-first-part = 33701
  6. postal-code-second-part = 4313
  7. country = USA

Higher granularity has overheads for data input and storage, but offers benefits in flexibility of data processing.

To change that a little, we think of granularity within the concept of Data Warehousing to mean how finely you track events/changes.  For example, if you think of investing, a trade could go through several stages during the day.  At the start of a day a trade could be set up. Then it gets approved.  Then it might get settled.  All before the close of the day.  If your grain is on a daily basis, you will only see the trade as it exists at the end of the day.  The day before you don't see anything with this trade and the day of you only see the trade in a settled status.  If instead you are trickle feeding, you see every one of these events in your warehouse. 

Now think in terms of making changes to how often you feed the data to the warehouse.  If you want to change it from daily to hourly, the warehouse design should not have to change at all.  If you want to move from daily to a trickle feed, it should handle that appropriately.  One way it is achieved is by using batching and surrogate keys. That is a data warehouse that is granularity agnostic. 

Print | posted on Friday, February 22, 2008 7:28 AM | Filed Under [ DataWarehousing ]

Feedback

Gravatar

# re: Data Warehouses Should Be Granularity Agnostic

It was brought to my attention that my wikipedia example is confusing to the rest of the article. It is to bring you into a frame of reference with one definition of granularity. I did put below it that I am contrasting that definition with a different definition of granularity as it relates to warehousing.

"To change that a little, we think of granularity within the concept of Data Warehousing to mean how finely you track events/changes."
2/22/2008 9:47 AM | Robz
Gravatar

# re: Data Warehouses Should Be Granularity Agnostic

A great way to insure that this process happens is to not actually "do" batches. By having your data warehouse consume events from the beginning you can be sure it can consume events later.

A batch is nothing but a series of events.

Why not have the other side just send a series of events for the batch then you can later decide to produce them in a trickle/hourly manner.

It becomes less a concern ... you are talking about changing the SLA of the message ... nothing more :-)
4/26/2008 5:06 PM | Greg Young
Gravatar

# re: Data Warehouses Should Be Granularity Agnostic

@Greg: Exactly! That is where I was going with this. The idea of getting items on an hourly or daily basis should not have any affect on your warehouse when you are granularity agnostic. If you get things every minute, you should just be able to handle them without complaining. :D
4/26/2008 5:34 PM | The Fervent Coder
Gravatar

# re: Data Warehouses Should Be Granularity Agnostic

Cool yeah ...

Another point to make here is that you can achieve this by making your data warehouse changes additive only.
4/26/2008 6:43 PM | Greg Young
Comments have been closed on this topic.

Powered by: