In-Depth
Delivering high-quality data
- By Wayne W. Eckerson
- October 3, 2002
As we enter the 21st century, we are at the dawn of the Information Age. Data
and information are now as vital to an organization's well-being and future
success as oxygen is to humans. And without a fresh supply of clean, unpolluted
data, companies will struggle to survive and thrive.
TDWI estimates that data-quality problems
cost U.S. businesses more than $600 billion each year. Yet most executives are
oblivious to the data quality lacerations that are slowly bleeding their
companies to death. More injurious than unnecessary printing, postage and
staffing costs is the slow but steady erosion of an organization's credibility
among customers and suppliers, and its inability to make sound decisions based
on accurate information.
The problem with data is that its quality quickly degenerates over time.
Experts say 2% of records in a customer file become obsolete in one month
because customers die, divorce, marry and move. In addition, data entry errors,
systems migrations and changes to source systems, among other things, generate
bucketloads of errors. More perniciously, as organizations fragment into
different divisions and units, interpretations of data elements mutate to meet
local business needs. A data element that one individual finds valuable may be
nonsense to an individual in a different group.
Fortunately, new strategic initiatives, such as CRM, business intelligence
and supply-chain management are sounding a wake-up call to top executives. Many
are learning the hard way that data-quality problems can sabotage the best laid
strategies and expose errors to a much broader, and critical, external
audience.
The good news is that achieving high-quality data is not beyond the means of
any company. The keys are to treat data as a strategic corporate resource;
develop a program for managing data quality with a commitment from the top; and
hire, train or outsource experienced data-quality professionals to oversee and
carry out the program. Then, it is critical for organizations to sustain a
commitment to managing data quality over time and adjust monitoring and
cleansing processes to changes in the business and underlying systems.
Commercial data-quality tools and service bureaus automate the process of
auditing, cleaning and monitoring data, and can be well worth the investment.
Most commercial tools are now moving beyond auditing and scrubbing name and
address data to tackle other data types. They are also beginning to step up to
the challenge of validating company-specific business rules, and augmenting
addresses with geo-spatial and demographic data, among other things.
Data is a vital resource. Companies that invest proportionally to manage this
resource will stand a better chance of succeeding in today's competitive global
economy than those that squander this critical resource by neglecting to ensure
adequate levels of quality.
Ensuring high-quality data requires the implementation of a methodology that
lays a foundation for ensuring that data remains at a consistently high quality
regardless of its location. This requires a universal corporate understanding of
the need for 'clean' data and a broad commitment to following the data-quality
management framework. A comprehensive explanation of the framework follows.
Step 1: Launch a data-quality program
The first step
toward high-quality data is to get managers to admit the problem and take
responsibility for it.
''A lot of executives talk about data quality, but few do anything about it,''
said Jim Lair, chairman of the Center for Data Quality in Reston, Va. ''Only a
disaster gets people to address data-quality issues.''
Today, most companies delegate authority for managing data quality to the IT
department. Although IT must be involved in the process, it does not have the
clout to change business processes or behavior that can substantially improve
data quality.
To succeed, a data-quality program must be initiated by the CEO, overseen by
the board of directors, and managed either by a chief data-quality officer or
senior-level business managers in each area of the business.
The best way to kick-start a data-quality initiative is to fold it into a
corporate data stewardship or data administration program. These programs are
typically chartered to establish and maintain consistent data definitions and
business rules so the company can achieve a ''single version of the truth'' and
save time on developing new applications and looking for data.
A corporate stewardship committee needs to develop a master plan for data
quality that contains a mission statement, objectives and goals. It then needs
to educate all employees about the plan.
The corporate stewardship committee also needs to oversee and provide
direction to all data-quality teams or functions scattered throughout the
company. Specifically, the committee should:
* Provide education on the
importance of data quality to the company.
* Communicate data-quality
improvements.
* Define mandatory data elements to be measured.
* Review
metrics for measuring quality.
* Define methods of reporting on data-quality
levels.
* Set precedence for establishing service-level agreements for data
quality.
* Establish the ''owners'' and ''custodians'' of each major subject area
and data store.
* Resolve cross-divisional, data-quality issues.
* Ensure
code sets are updated regularly.
Step 2: Develop a project plan
The next step is to develop a data-quality
project plan that defines the scope of activity, sets goals, estimates ROI,
performs a gap analysis, identifies actions, and measures and monitors success.
The team will need to dig into the data to assess its current state, define
corrective actions and establish metrics for monitoring conformance to
goals.
After a preliminary data assessment, a company should have an understanding
of high-priority areas. A scope document identifies key data elements to be
analyzed, measured for validity, cleaned and monitored.
A plan should have a well-defined but realistic goal. It should also estimate
the ROI and payback period for each project, and define how data will be
corrected. Programs should be instituted to monitor key data elements.
Step 3: Build a data-quality team
Organizations must assign or hire
individuals to create the plan, perform initial assessment, scrub the data and
set up monitoring systems to maintain adequate data quality.
Although an organization may not be large enough to dedicate individuals to
each position, it is important that someone assume these responsibilities. In
small- or mid-sized organizations, one person may assume responsibility for
multiple roles outlined in the list.
Most successful data warehousing programs hire one or two full-time,
data-quality analysts who serve on the front lines. These analysts monitor
data-quality levels, analyze the source of problems and recommend corrective
actions to the oversight committee. They may report to a data or information
quality leader or, more likely, the head of data warehousing or data
administration.
Ideally, data-quality analysts are recruited from an operational role on the
business side. They should be familiar with transaction processing, and
understand data fields and business rules.
Steps 4 and 5: Review business processes and
data architecture
Once there
is corporate backing, the stewardship committee needs to review the business
processes for collecting, recording and using data in the areas defined by the
scope document. With help from consultants, the team must evaluate the
underlying architecture that supports the business practices and information
flows.
The business and technical reviews can take several days or weeks, depending
on the scale and scope of the data-quality plan and the number of
stakeholders.
The team needs to document how data elements are collected and where they are
stored. The team also needs to identify who owns the data, who uses the data,
what kind of reports are needed and what quality checks exist.
The business process review should expose processes that need to be modified
and suggest metrics for measuring data quality, said Elizabeth Vannan, project
manager at the Centre for Education Information in Victoria, British
Columbia.
At the same time, the technical review can reveal whether there are
opportunities to re-architect or replace systems to improve data quality and
optimize data collection. These changes may also be needed to pave the way for
new strategic applications.
A good technical architecture will do a number of things to enhance data
quality: implement validation routines, ETL and data-quality tools, and
data-quality checks; consolidate data collection points and shared data;
minimize system interfaces; and implement a data integration hub and a meta data
repository.
Step 6: Assess data quality
After reviewing processes and architectures, a
thorough assessment of data quality in key subject areas is required.
The assessment will identify common data defects, create metrics to detect
defects as they enter the data warehouse, and create rules or recommend actions
for fixing the data. This can be long, labor-intensive work.
Data warehousing consultants have seen too many companies fall victim to the
''code, load and explode'' phenomenon. These types of organizations bypass a
rigorous source-code analysis to code, extract and transform programs. When they
load data into the warehouse, it spits out huge numbers of errors, making it
virtually unusable.
Consequently, many consultants recommend that companies minimize the number
of source systems when building their first data warehouse. This technique
minimizes project delays and shortens the process of auditing and fixing source
data.
The assessment systematically reviews all data elements and identifies
problems like missing data, incorrect values, duplicate records and business
rule violations.
The results can be shocking, if not depressing. Consider the following data
anomalies discovered by the Center for Data Quality in audits performed for
clients:
* Required social security number missing from 82% of claims.
*
Blank fields in 30% of a million records from a securities firm.
* Customer
status missing in 47% of databases.
* A database contains 1,100 organization
codes but only 18 are valid.
* About 82% of loans are calculated on the low
side.
* Duplicate payroll numbers.
Or consider these defects that the British Columbia Ministry of Advanced
Education discovered:
* Students over 2,000 years old or not yet born.
*
Course sections that occurred before the college was established.
* Course
sections that ended before they started.
* Students registered in the same
course section, multiple times.
* Invalid program, course section and course
codes.
There are several ways to audit existing data files. Companies can issue SQL
queries against a sample of data, use a commercial data-profiling tool or send a
data sample to a service bureau for evaluation.
Of the three options, most organizations prefer to perform the work in-house
using SQL, according to a TDWI survey. One drawback to this approach is that it
limits queries to known or existing conditions.
On the other hand, commercial data-auditing tools use pattern recognition and
classification techniques to dig deeper into the tables. Dependency profiling
also enables users to reverse-engineer the source system data models, which can
facilitate data warehouse design.
The last option -- outsourcing sample data to a service bureau -- is not
popular yet, according to our survey. But it can be a good alternative if you
need to:
* Kick-start a data-quality initiative.
* Perform an audit
without in-house resources.
* Verify against third-party databases.
* Adhere to a company audit policy.
* Augment existing cleansing processes.
*
Rely on people with expertise and experience in data-quality management that do
not exist in-house.
Peter Harvey, CEO at Intellidyn, a marketing analytics
firm in New York, said when his firm audited recently ''cleaned'' customer files
from clients, it found 5% of the file contained duplicate records. The
duplication rate for untouched customer files can be 20% or more.
Harvey said service bureaus should be able to demonstrate a
huge ROI and immediate (three-to-six-month) payback on the fees they
charge.
Most service bureaus parse and match data, but they also need to
correct errors and fill in missing data, explained Harvey. This is vital if the
client (or service bureau) needs to model customer behavior using data mining or
other analytical tools.
The key to any data-auditing approach is working closely with experts who
understand both the business and the data. These experts need to study audit
reports and determine which 'anomalies' represent defects and which are
valid.
Subject-matter experts then need to define business rules for cleaning
defects. These rules can include mapping codes, assigning default values to
missing fields or standardizing fields against a reference. They can also
recommend whether to fix data at the source, the staging area or in the data
warehouse, or whether to change business processes to prevent defects from
occurring in the first place.
They should then suggest metrics for a constant measuring of the quality of
the data file. Potential metrics include:
* Acceptable number of invoices or
claims.
* Proper syntax on various fields, like name and date.
* Counts on
unique identifiers, such as customer IDs.
* Acceptable date ranges for data
elements.
* Formulas for calculating loan amounts.
* Rules defining
constraints between fields.
Step 7: Clean the data
The job of cleaning the data begins at the close of
the audit. A fundamental principle is to detect and fix defects as close as
possible to the source to minimize costs.
Prevention is the least-costly response to defects, followed by correction
and repair. Correction involves fixing defects in-house, while repair involves
fixing defects that affect customers directly.
Defect prevention programs may cost significant money, but pay bigger
dividends in the long run.
There are four basic methods for ''cleaning'' data:
1. Correct. Most
cleansing operations involve fixing both defective data elements and records.
Correcting data elements typically requires you to modify an existing incorrect
value (fix a misspelling); modify a correct value to make it conform to a
corporate or industry standard (substitute ''Mr.'' for ''Mister''); or replace a
missing value.
Correcting records typically requires you to match and merge duplicate
records that exist in the same file or multiple files, and decouple incorrectly
merged records. Decoupling is required when a single record contains data
describing two or more entities.
To correct data in a relational database, analysts use SQL or a commercial
data-quality tool with SQL support. To correct defects in non-SQL databases, use
the native data manipulation language. To correct data during extraction or
loading processes, use ETL tools.
2. Filter. Filtering involves deleting duplicate, missing or nonsensical data
elements. Use caution when filtering data because it may create data integrity
problems.
3. Detect and report. In some cases, you may not want to change defective
data because it is not cost-effective or possible. In these cases, analysts need
to notify users and document the condition in meta data.
4. Prevent. Prevention involves educating data-entry people, changing or
applying new validations to operational systems, updating outdated codes,
redesigning systems and models, or changing rules and processes.
If defects cannot be prevented, the best place to clean data is in the source
system so the defects cannot spread to the data warehouse and other downstream
systems. Most respondents fix data at the source.
Fixing errors at the source is important when totals in the data warehouse
must reconcile exactly with data in the source systems. This is typical with
financial apps.
''If you can't reconcile totals, the data warehouse will face a credibility
problem, even if its data is more accurate,'' noted Darrell Piatt, a consultant
at Kaloke Technologies, Dover, Del. ''There is a real problem with having too
much quality.''
Besides cleaning data at the source, organizations can fix data in other
places:
* Staging area -- Fixing defective data in a staging area prevents
errors from source systems, external data files or the extraction process itself
from entering the warehouse, and it reduces the workload on ETL processes.
* ETL process
-- ETL tools can perform some, but not all, data-cleansing
operations. However, they can be combined or, in some cases, integrated with
data-cleansing tools to provide a more complete solution.
* Data warehouse --
Since ETL processes can introduce errors, one must audit and clean data in the
data warehouse before allowing access. However, cleaning data in the data
warehouse shifts the burden from the data creators to the data users. ''If an
error exists in the data warehouse, it has already impacted the business,'' said
Celia Fuller, director of the Corporate Data Warehouse at BlueCross BlueShield
of North Carolina.
Step 8: Improve business practices
''A data-quality problem is a symptom of
the need for change in the current process,'' said Brad Bergh, a database
designer with Double Star Inc. Improving established processes often stokes
political and cultural fires, but the payoff can be great.
Under the auspices of the CEO and the direction of corporate data stewards, a
company can begin fundamental changes in how it does business.
There are several ways to improve business processes and
practices:
Educate. Use newsletters, corporate memos, the news media, and
employee and shareholder meetings to communicate the importance of data to the
company.
Train and reward. Continuously train new and existing data-entry people about
data standards. More importantly, reward them for improving data quality. And
make sure existing incentives do not undermine quality.
Drive business impact close to
home. Source system owners are not motivated
to change until data defects materially affect their business. Show business
units how data-entry practices can cost money and business.
Apply validations. Prevent errors by building stringent validation checks on
data-entry fields.
Standardize codes, rules and
definitions. Bring together senior managers to
agree on standard codes, data definitions and business rules.
Leverage successes. Some groups will resist standardization, sometimes for
legitimate reasons. Tackle easy projects first, and leverage these successes to
put pressure on holdouts.
Redesign systems. Redesign data models to more accurately reflect the
business.
Step 9: Monitor data continuously
It is very time-consuming to prepare
data files when loading a database for the first time. But organizations can
lose the benefits of these data preparation efforts quickly if they fail to
monitor data quality continuously.
Companies need to build a program that audits data at regular intervals, or
just before or after data is loaded into another system such as a data
warehouse. Companies then use the audit reports to measure their progress in
achieving data-quality goals and complying with service-level agreements.
Service-level agreements can help boost confidence in the data warehouse or
source systems. For example, some organizations do not allow knowledge workers
to access the data warehouse until the team meets data-quality levels specified
in a service-level agreement.
If thresholds are exceeded, data-quality analysts need to examine the data in
question and develop a plan for remedying the situation.
Though no one can argue that implementing such a program is not easy for
corporations, it is nonetheless vital for competing in the information economy
of the 21st century. Once the program is implemented, it becomes much simpler to
create an ongoing data-quality endeavor that can guard against the standard
deterioration of data over time.
Such investments in data quality are vital even in difficult economic times
-- probably even more so. The costs of ''dirty'' data can undermine organizations
in any economic climate. The rewards of clean data can provide a clear edge over
competitors.
Related
stories:
Editorial: Volume 1, number
1 by Michael W. Bucken
From the field: Data warehouse
implementations by Rich Seeley, Jack Vaughan and Michael W. Bucken
Data Insight product guide
compiled by Lana Gates and Donna Sussman