data modeling whitepaper

Debating the importance of NULL data in Enterprise Data Warehouse layer

We would like to begin this article with the following NULL value definition retrieved from the Microsoft SQL Server books online site:

NULL indicates that the value is unknown. A null value is different from an empty or zero value. No two null values are equal. Comparisons between two null values, or between a null value and any other value, return unknown because the value of each NULL is unknown.

Null values generally indicate data that is unknown, not applicable, or to be added later. For example, a customer's middle initial may not be known at the time the customer places an order.

But how does the Corepula Method deal with nullable attributes that are not foreign keys? It naturally handles NULL attribute values and allows for their explicit storage in the underlying schema. The reasoning for this is simple. A NULL value cannot be ignored; it has a clear meaning. A NULL value specifies that the attribute's underlying value is unknown. Whatever data modeling solution you use (MDM, RDM, or Enterprise Data Warehouse), keep the following simple rule in mind: data modelers and ETL developers cannot invent data. The job of the Enterprise Data Warehouse layer is to accurately integrate data from various sources and preserve it for the business users' decision making process. If a source data attribute has a NULL value, then the same data attribute must be set to a NULL value in the Enterprise Data layer, unless the modeler is told otherwise by business. For auditing purposes, data in the source system and Enterprise Data Warehouse layers must match to satisfy various regulatory requirements. If data architects begin modifying NULL values and initialize them to some datatype specific pre-set values, various discrepancies will arise between the source and Enterprise Data layers. For example, we've seen Enterprise Data Warehousing schemas where numerically nullable attributes were set to zero (0) whenever the ETL processes encountered numeric source attributes set to NULL. These types of data conversions create problems later on, when the data have to be reconciled against various source systems. In your own practice, once the null-valued numeric source records are translated, try distinguishing between null (or unknown) and zero (0) sale amounts; they convey different meanings to business. Unless business provides an official mapping rule and explicitly states how to translate a NULL value in a particular field to some specific code, the Corepula Method recommends that source data be preserved and NULL values stored as is. All attributes in a Corepula Method schema are linked to some SINCE DATE attribute; storing a NULL value as an attribute value and associating it with a timespan attribute helps to cleanly identify the time period during which the specific value was unknown.

For a more thorough introduction into Corepula Method, please download Corepula Method data modeling whitepaper (PDF)