In-Depth

Delivering high-quality data

As we enter the 21st century, we are at the dawn of the Information Age. Data and information are now as vital to an organization's well-being and future success as oxygen is to humans. And without a fresh supply of clean, unpolluted data, companies will struggle to survive and thrive.

TDWI estimates that data-quality problems cost U.S. businesses more than $600 billion each year. Yet most executives are oblivious to the data quality lacerations that are slowly bleeding their companies to death. More injurious than unnecessary printing, postage and staffing costs is the slow but steady erosion of an organization's credibility among customers and suppliers, and its inability to make sound decisions based on accurate information.

The problem with data is that its quality quickly degenerates over time. Experts say 2% of records in a customer file become obsolete in one month because customers die, divorce, marry and move. In addition, data entry errors, systems migrations and changes to source systems, among other things, generate bucketloads of errors. More perniciously, as organizations fragment into different divisions and units, interpretations of data elements mutate to meet local business needs. A data element that one individual finds valuable may be nonsense to an individual in a different group.

Fortunately, new strategic initiatives, such as CRM, business intelligence and supply-chain management are sounding a wake-up call to top executives. Many are learning the hard way that data-quality problems can sabotage the best laid strategies and expose errors to a much broader, and critical, external audience.

The good news is that achieving high-quality data is not beyond the means of any company. The keys are to treat data as a strategic corporate resource; develop a program for managing data quality with a commitment from the top; and hire, train or outsource experienced data-quality professionals to oversee and carry out the program. Then, it is critical for organizations to sustain a commitment to managing data quality over time and adjust monitoring and cleansing processes to changes in the business and underlying systems.

Commercial data-quality tools and service bureaus automate the process of auditing, cleaning and monitoring data, and can be well worth the investment. Most commercial tools are now moving beyond auditing and scrubbing name and address data to tackle other data types. They are also beginning to step up to the challenge of validating company-specific business rules, and augmenting addresses with geo-spatial and demographic data, among other things.

Data is a vital resource. Companies that invest proportionally to manage this resource will stand a better chance of succeeding in today's competitive global economy than those that squander this critical resource by neglecting to ensure adequate levels of quality.

Ensuring high-quality data requires the implementation of a methodology that lays a foundation for ensuring that data remains at a consistently high quality regardless of its location. This requires a universal corporate understanding of the need for 'clean' data and a broad commitment to following the data-quality management framework. A comprehensive explanation of the framework follows.

Step 1: Launch a data-quality program
The first step toward high-quality data is to get managers to admit the problem and take responsibility for it.

''A lot of executives talk about data quality, but few do anything about it,'' said Jim Lair, chairman of the Center for Data Quality in Reston, Va. ''Only a disaster gets people to address data-quality issues.''

Today, most companies delegate authority for managing data quality to the IT department. Although IT must be involved in the process, it does not have the clout to change business processes or behavior that can substantially improve data quality.

To succeed, a data-quality program must be initiated by the CEO, overseen by the board of directors, and managed either by a chief data-quality officer or senior-level business managers in each area of the business.

The best way to kick-start a data-quality initiative is to fold it into a corporate data stewardship or data administration program. These programs are typically chartered to establish and maintain consistent data definitions and business rules so the company can achieve a ''single version of the truth'' and save time on developing new applications and looking for data.

A corporate stewardship committee needs to develop a master plan for data quality that contains a mission statement, objectives and goals. It then needs to educate all employees about the plan.

The corporate stewardship committee also needs to oversee and provide direction to all data-quality teams or functions scattered throughout the company. Specifically, the committee should:
* Provide education on the importance of data quality to the company.
* Communicate data-quality improvements.
* Define mandatory data elements to be measured.
* Review metrics for measuring quality.
* Define methods of reporting on data-quality levels.
* Set precedence for establishing service-level agreements for data quality.
* Establish the ''owners'' and ''custodians'' of each major subject area and data store.
* Resolve cross-divisional, data-quality issues.
* Ensure code sets are updated regularly.

Step 2: Develop a project plan
The next step is to develop a data-quality project plan that defines the scope of activity, sets goals, estimates ROI, performs a gap analysis, identifies actions, and measures and monitors success. The team will need to dig into the data to assess its current state, define corrective actions and establish metrics for monitoring conformance to goals.

After a preliminary data assessment, a company should have an understanding of high-priority areas. A scope document identifies key data elements to be analyzed, measured for validity, cleaned and monitored.

A plan should have a well-defined but realistic goal. It should also estimate the ROI and payback period for each project, and define how data will be corrected. Programs should be instituted to monitor key data elements.

Step 3: Build a data-quality team
Organizations must assign or hire individuals to create the plan, perform initial assessment, scrub the data and set up monitoring systems to maintain adequate data quality.

Although an organization may not be large enough to dedicate individuals to each position, it is important that someone assume these responsibilities. In small- or mid-sized organizations, one person may assume responsibility for multiple roles outlined in the list.

Most successful data warehousing programs hire one or two full-time, data-quality analysts who serve on the front lines. These analysts monitor data-quality levels, analyze the source of problems and recommend corrective actions to the oversight committee. They may report to a data or information quality leader or, more likely, the head of data warehousing or data administration.

Ideally, data-quality analysts are recruited from an operational role on the business side. They should be familiar with transaction processing, and understand data fields and business rules.

Steps 4 and 5: Review business processes and data architecture
Once there is corporate backing, the stewardship committee needs to review the business processes for collecting, recording and using data in the areas defined by the scope document. With help from consultants, the team must evaluate the underlying architecture that supports the business practices and information flows.

The business and technical reviews can take several days or weeks, depending on the scale and scope of the data-quality plan and the number of stakeholders.

The team needs to document how data elements are collected and where they are stored. The team also needs to identify who owns the data, who uses the data, what kind of reports are needed and what quality checks exist.

The business process review should expose processes that need to be modified and suggest metrics for measuring data quality, said Elizabeth Vannan, project manager at the Centre for Education Information in Victoria, British Columbia.

At the same time, the technical review can reveal whether there are opportunities to re-architect or replace systems to improve data quality and optimize data collection. These changes may also be needed to pave the way for new strategic applications.

A good technical architecture will do a number of things to enhance data quality: implement validation routines, ETL and data-quality tools, and data-quality checks; consolidate data collection points and shared data; minimize system interfaces; and implement a data integration hub and a meta data repository.

Step 6: Assess data quality
After reviewing processes and architectures, a thorough assessment of data quality in key subject areas is required.

The assessment will identify common data defects, create metrics to detect defects as they enter the data warehouse, and create rules or recommend actions for fixing the data. This can be long, labor-intensive work.

Data warehousing consultants have seen too many companies fall victim to the ''code, load and explode'' phenomenon. These types of organizations bypass a rigorous source-code analysis to code, extract and transform programs. When they load data into the warehouse, it spits out huge numbers of errors, making it virtually unusable.

Consequently, many consultants recommend that companies minimize the number of source systems when building their first data warehouse. This technique minimizes project delays and shortens the process of auditing and fixing source data.

The assessment systematically reviews all data elements and identifies problems like missing data, incorrect values, duplicate records and business rule violations.

The results can be shocking, if not depressing. Consider the following data anomalies discovered by the Center for Data Quality in audits performed for clients:
* Required social security number missing from 82% of claims.
* Blank fields in 30% of a million records from a securities firm.
* Customer status missing in 47% of databases.
* A database contains 1,100 organization codes but only 18 are valid.
* About 82% of loans are calculated on the low side.
* Duplicate payroll numbers.

Or consider these defects that the British Columbia Ministry of Advanced Education discovered:
* Students over 2,000 years old or not yet born.
* Course sections that occurred before the college was established.
* Course sections that ended before they started.
* Students registered in the same course section, multiple times.
* Invalid program, course section and course codes.

There are several ways to audit existing data files. Companies can issue SQL queries against a sample of data, use a commercial data-profiling tool or send a data sample to a service bureau for evaluation.

Of the three options, most organizations prefer to perform the work in-house using SQL, according to a TDWI survey. One drawback to this approach is that it limits queries to known or existing conditions.

On the other hand, commercial data-auditing tools use pattern recognition and classification techniques to dig deeper into the tables. Dependency profiling also enables users to reverse-engineer the source system data models, which can facilitate data warehouse design.

The last option -- outsourcing sample data to a service bureau -- is not popular yet, according to our survey. But it can be a good alternative if you need to:
* Kick-start a data-quality initiative.
* Perform an audit without in-house resources.
* Verify against third-party databases.
* Adhere to a company audit policy.
* Augment existing cleansing processes.
* Rely on people with expertise and experience in data-quality management that do not exist in-house.

Peter Harvey, CEO at Intellidyn, a marketing analytics firm in New York, said when his firm audited recently ''cleaned'' customer files from clients, it found 5% of the file contained duplicate records. The duplication rate for untouched customer files can be 20% or more.

Harvey said service bureaus should be able to demonstrate a huge ROI and immediate (three-to-six-month) payback on the fees they charge.

Most service bureaus parse and match data, but they also need to correct errors and fill in missing data, explained Harvey. This is vital if the client (or service bureau) needs to model customer behavior using data mining or other analytical tools.

The key to any data-auditing approach is working closely with experts who understand both the business and the data. These experts need to study audit reports and determine which 'anomalies' represent defects and which are valid.

Subject-matter experts then need to define business rules for cleaning defects. These rules can include mapping codes, assigning default values to missing fields or standardizing fields against a reference. They can also recommend whether to fix data at the source, the staging area or in the data warehouse, or whether to change business processes to prevent defects from occurring in the first place.

They should then suggest metrics for a constant measuring of the quality of the data file. Potential metrics include:
* Acceptable number of invoices or claims.
* Proper syntax on various fields, like name and date.
* Counts on unique identifiers, such as customer IDs.
* Acceptable date ranges for data elements.
* Formulas for calculating loan amounts.
* Rules defining constraints between fields.

Step 7: Clean the data
The job of cleaning the data begins at the close of the audit. A fundamental principle is to detect and fix defects as close as possible to the source to minimize costs.

Prevention is the least-costly response to defects, followed by correction and repair. Correction involves fixing defects in-house, while repair involves fixing defects that affect customers directly.

Defect prevention programs may cost significant money, but pay bigger dividends in the long run.

There are four basic methods for ''cleaning'' data:
1. Correct. Most cleansing operations involve fixing both defective data elements and records. Correcting data elements typically requires you to modify an existing incorrect value (fix a misspelling); modify a correct value to make it conform to a corporate or industry standard (substitute ''Mr.'' for ''Mister''); or replace a missing value.

Correcting records typically requires you to match and merge duplicate records that exist in the same file or multiple files, and decouple incorrectly merged records. Decoupling is required when a single record contains data describing two or more entities.

To correct data in a relational database, analysts use SQL or a commercial data-quality tool with SQL support. To correct defects in non-SQL databases, use the native data manipulation language. To correct data during extraction or loading processes, use ETL tools.

2. Filter. Filtering involves deleting duplicate, missing or nonsensical data elements. Use caution when filtering data because it may create data integrity problems.

3. Detect and report. In some cases, you may not want to change defective data because it is not cost-effective or possible. In these cases, analysts need to notify users and document the condition in meta data.

4. Prevent. Prevention involves educating data-entry people, changing or applying new validations to operational systems, updating outdated codes, redesigning systems and models, or changing rules and processes.

If defects cannot be prevented, the best place to clean data is in the source system so the defects cannot spread to the data warehouse and other downstream systems. Most respondents fix data at the source.

Fixing errors at the source is important when totals in the data warehouse must reconcile exactly with data in the source systems. This is typical with financial apps.

''If you can't reconcile totals, the data warehouse will face a credibility problem, even if its data is more accurate,'' noted Darrell Piatt, a consultant at Kaloke Technologies, Dover, Del. ''There is a real problem with having too much quality.''

Besides cleaning data at the source, organizations can fix data in other places:
* Staging area -- Fixing defective data in a staging area prevents errors from source systems, external data files or the extraction process itself from entering the warehouse, and it reduces the workload on ETL processes.
* ETL process -- ETL tools can perform some, but not all, data-cleansing operations. However, they can be combined or, in some cases, integrated with data-cleansing tools to provide a more complete solution.
* Data warehouse -- Since ETL processes can introduce errors, one must audit and clean data in the data warehouse before allowing access. However, cleaning data in the data warehouse shifts the burden from the data creators to the data users. ''If an error exists in the data warehouse, it has already impacted the business,'' said Celia Fuller, director of the Corporate Data Warehouse at BlueCross BlueShield of North Carolina.

Step 8: Improve business practices
''A data-quality problem is a symptom of the need for change in the current process,'' said Brad Bergh, a database designer with Double Star Inc. Improving established processes often stokes political and cultural fires, but the payoff can be great.

Under the auspices of the CEO and the direction of corporate data stewards, a company can begin fundamental changes in how it does business.

There are several ways to improve business processes and practices:
Educate. Use newsletters, corporate memos, the news media, and employee and shareholder meetings to communicate the importance of data to the company.

Train and reward. Continuously train new and existing data-entry people about data standards. More importantly, reward them for improving data quality. And make sure existing incentives do not undermine quality.

Drive business impact close to home. Source system owners are not motivated to change until data defects materially affect their business. Show business units how data-entry practices can cost money and business.

Apply validations. Prevent errors by building stringent validation checks on data-entry fields.

Standardize codes, rules and definitions. Bring together senior managers to agree on standard codes, data definitions and business rules.

Leverage successes. Some groups will resist standardization, sometimes for legitimate reasons. Tackle easy projects first, and leverage these successes to put pressure on holdouts.

Redesign systems. Redesign data models to more accurately reflect the business.

Step 9: Monitor data continuously
It is very time-consuming to prepare data files when loading a database for the first time. But organizations can lose the benefits of these data preparation efforts quickly if they fail to monitor data quality continuously.

Companies need to build a program that audits data at regular intervals, or just before or after data is loaded into another system such as a data warehouse. Companies then use the audit reports to measure their progress in achieving data-quality goals and complying with service-level agreements.

Service-level agreements can help boost confidence in the data warehouse or source systems. For example, some organizations do not allow knowledge workers to access the data warehouse until the team meets data-quality levels specified in a service-level agreement.

If thresholds are exceeded, data-quality analysts need to examine the data in question and develop a plan for remedying the situation.

Though no one can argue that implementing such a program is not easy for corporations, it is nonetheless vital for competing in the information economy of the 21st century. Once the program is implemented, it becomes much simpler to create an ongoing data-quality endeavor that can guard against the standard deterioration of data over time.

Such investments in data quality are vital even in difficult economic times -- probably even more so. The costs of ''dirty'' data can undermine organizations in any economic climate. The rewards of clean data can provide a clear edge over competitors.

Related stories:
Editorial: Volume 1, number 1 by Michael W. Bucken
From the field: Data warehouse implementations by Rich Seeley, Jack Vaughan and Michael W. Bucken
Data Insight product guide compiled by Lana Gates and Donna Sussman