News
Will dirty data always be with us?
- By Peter Bochner
- April 3, 2002
IT departments are dealing with more data than ever, but it is locked up in
a patchwork of disparate repositories: legacy systems, relational databases,
data warehouses, Web pages, e-mail and the like.
"There is data all over the place," said Denise Draper, chief software
architect at Nimble Technology Inc., a data integration solution provider. "Individual
enterprises sometimes have thousands of places where they store data, and it
is stored in systems that were not designed to work together."
With the onslaught of Web services -- the capability to perform integrated
applications across multiple platforms over the Internet -- data integration
problems are going to get a lot deeper. "Web services make the problem
of data integration worse because they're about connecting things across wider
boundaries," noted Draper. "Now you have to integrate your data with
the data of your customers, partners and suppliers."
The semi-structured nature of XML makes it easier for an organization to share
data among applications. Most organizations that are tuned into XML employ it
as an interchange layer to integrate different applications, or to construct
portals or B2B exchanges. But by itself, said Draper, it can't solve data integration
problems because it has no implicit semantics -- it does not define what tags
mean.
Standards will help, she said. Some industries, such as financial services,
are proposing their own XML-based standards so that when you communicate with
a business partner, you can reasonably expect what you will see. Even with such
standards, data integration will remain an issue, said Draper.
Further defusing the problem will be the new technology of dynamic data integration.
"What's required to make data integration work for Web services, or even
for the modern generation of rapid application development, is a strong meta
data-driven approach -- one that defines what the different data sources are
and what data transformations you need to do," she explained.
Technologies that cut XML overhead are also on the horizon. "Productivity
in application development matters," explained Draper. "But you can't
ignore performance. As people start to use Web services and XML, they will develop
the technologies to compress XML data and make those services as efficient as
can be."
Despite all these advances, however, Draper said data integration will always
be a "fundamental problem." Neither XML nor Web services address the
problem of dirty data -- what you get when you try to mix data from multiple
sources without a clean identification key. For instance, is the "John
L. Smith" in one system the same person as another's "Smith, John"?
"How different people address people's names, business names or product
names is an issue that will never go away," said Draper. "Dirty data
will always be with us."