News

Will dirty data always be with us?

IT departments are dealing with more data than ever, but it is locked up in a patchwork of disparate repositories: legacy systems, relational databases, data warehouses, Web pages, e-mail and the like.

"There is data all over the place," said Denise Draper, chief software architect at Nimble Technology Inc., a data integration solution provider. "Individual enterprises sometimes have thousands of places where they store data, and it is stored in systems that were not designed to work together."

With the onslaught of Web services -- the capability to perform integrated applications across multiple platforms over the Internet -- data integration problems are going to get a lot deeper. "Web services make the problem of data integration worse because they're about connecting things across wider boundaries," noted Draper. "Now you have to integrate your data with the data of your customers, partners and suppliers."

The semi-structured nature of XML makes it easier for an organization to share data among applications. Most organizations that are tuned into XML employ it as an interchange layer to integrate different applications, or to construct portals or B2B exchanges. But by itself, said Draper, it can't solve data integration problems because it has no implicit semantics -- it does not define what tags mean.

Standards will help, she said. Some industries, such as financial services, are proposing their own XML-based standards so that when you communicate with a business partner, you can reasonably expect what you will see. Even with such standards, data integration will remain an issue, said Draper.

Further defusing the problem will be the new technology of dynamic data integration. "What's required to make data integration work for Web services, or even for the modern generation of rapid application development, is a strong meta data-driven approach -- one that defines what the different data sources are and what data transformations you need to do," she explained.

Technologies that cut XML overhead are also on the horizon. "Productivity in application development matters," explained Draper. "But you can't ignore performance. As people start to use Web services and XML, they will develop the technologies to compress XML data and make those services as efficient as can be."

Despite all these advances, however, Draper said data integration will always be a "fundamental problem." Neither XML nor Web services address the problem of dirty data -- what you get when you try to mix data from multiple sources without a clean identification key. For instance, is the "John L. Smith" in one system the same person as another's "Smith, John"? "How different people address people's names, business names or product names is an issue that will never go away," said Draper. "Dirty data will always be with us."