A workable way to approach data quality management is to make it the "absence of intolerable defects." Here's a look at some potential intolerable defects in your environment and what can be done about them.
Data types constrain the values in a column ... to a degree. Any mix of 0-20 characters can go into a character (20) data type column. However, if this is a Name field, there are some characters that you would not expect to find in the column such as % and $. These would be "red flags" that the field contained inappropriate data. There are also numerous misspellings and incorrect alternative spellings of last names. Often, a manual review of column contents, with counts of each unique value will bring to light the one correct spelling.
There are two approaches to handling the violations. Usually a combination is best. You can "generalize" into rules the various formatting errors that are found in the field. Typical of these formatting errors found in name columns include:
- Space in front of name
- Two spaces between first and last name and/or middle initial
- No period after middle initial
- Inconsistent use of middle initial (sometimes used, sometimes not)
- Use of all caps
- Use of "&" instead of "and" when indicating plurality
- Use of slash instead of hyphen
On and on it goes, especially in environments where original data entry is "free form," unconstrained and without the use of master data as a reference.
It is not possible to generalize to rules things like use of initials and misspellings (i.e., William McNight instead of William McKnight) so they need to be handled separately. You can map the incorrect data to the correct data in your data warehouse's staging area. As new data is discovered (i.e., is Bill McKnight the same as William McKnight?), it is held out until review after which it can be mapped to incorrect or correct data and re-routed through the ETL process.
If adapting this approach, be sure procedurally the reviews are held quickly because data will be held out of the data warehouse until it is accounted for. The mapping would take place after the rules are applied.
With either or both approaches, I recommend actually bringing the "bad" value over as well since often users will want to know what the source data actually had in it.
Read part two of this tip.
For more information, check out this Learning Guide for Data Quality.