One of our clients is researching a new initiative where key master data is going to be sourced from multiple systems, used in the data warehouse and also used by other systems. The ETL tool vendors all seem to be releasing data quality and master data management tools. Given that WhereScape is a data warehouse software company I think they were expecting us to make a land grab and make a "do it all in the data warehouse / do it all in WhereScape RED" type recomendation.
One of the things we provided them was a sidebar from our new prototype and iterate while paper. It is more a position on data quality, but does also cover my view on master data management. I have included the extract (slightly edited) here:
Data Quality And Data Noise
Today, with the rise of master data management and other corporate data quality disciplines, the mentality that data quality is a problem to be solved as part of a data warehousing project is in decline, and the integration of data quality facilities into ETL processes is viewed - as we believe it should be - with increasing amounts of skepticism.Nevertheless, during prototype and iterate sessions data warehousing practitioners frequently uncover dirty data - often where data is presumed to be neat and clean - and still struggle with the normal, but dangerous, tendency to want to solve data quality problems as part of the data warehousing project.Experienced WhereScape practitioners report that a few simple rules are usually sufficient to keep data warehousing prototypes from veering into master data management or data governance areas:
• Data quality problems should be solved at the source: the poorly-designed transactional system that introduces the data noise, or the broken business process that the transactional system in question supports.
• Data warehouses should always reconcile to their transactional sources exactly, to promote user confidence and ensure that there is no decision gap between user communities using the data warehouse and those using operational reports from the transactional system (which persist, despite efforts to the contrary).
• Cleansing and normalization of business dimensions is a master data management problem, not a data warehousing problem. Data warehouses that include dimensional values that are “regularized” or “cleansed” may be functioning as de facto master data management repositories, and that is not their role. If master data management discipline is required, projects to implement master data management should be undertaken separately from the data warehousing project(s) at hand.
• Data warehouses and marts are downstream from master data management repositories, not upstream from them. That is, warehouses and marts make use of canonical dimensions in MDM repositories; they do not supply those canonical dimensions.
• Users are the best judge of acceptable levels of data noise. Except in those relatively rare situations in which a data warehousing or data marting project covers a subject area or data set with which the end-user constituency has no prior familiarity, end-users are already compensating for dirty data in their analyses, and will continue to do so without much complaint in most cases.
Clarifying project objectives with project sponsors is of course critical. In the view of experienced WhereScape practitioners, data warehousing projects are better used to highlight issues with data quality than they are to solve those issues, and knowing whether project sponsors intend to spend project budget to resolve transactional source system hygiene - as well as knowing whether end-users are prepared to live with ‘noisy data’ while those source system problems are fixed - is critically important to overall project success.