In-Depth

ETL tools still seeking a niche

The first tools for populating data warehouses focused on moving data from relational databases. They provided GUIs for pulling data from an RDBMS (extracting it), massaging the data into a standard format (transforming it), and putting it in the warehouse or a data mart (loading it). Appropriately, these tools were called extract, transform and load (ETL) tools.

But RDBMSs were not intended for analytic decision support. To improve corporate decision-making, ETL tools were soon being used to integrate data from other sources, including operational systems as well as proprietary and packaged apps. In no time, data warehouses were sitting in the middle of all the pieces that had to talk to each other in an enterprise -- legacy systems, ERP systems, custom applications and flat files, many of them incompatible -- and the business intelligence users who needed that information in a standard format to make decisions.

"In the ETL world, the focus has moved from data warehouse-centric to enterprise-centric," said Michael Cassettari, vice president of marketing at Ascential Software, a Westborough, Mass., maker of ETL and data integration software.

With companies looking for faster ways to make more information available to a greater number of employees for reporting and analysis, demand grew for ETL software that could perform data integration. But as ETL tools became more firmly entrenched in the market, other issues surrounding data integration appeared, such as poor data quality, huge volumes of data and the amount of time required for data transformation.

Getting dirty data in shape

When customer information is stored on a maxed-out database or comes in from an outdated data entry system, it is hard to get the full picture of the customer relationship. Many businesses need to scrub their data to purge bogus records, duplicates, and bad addresses and phone numbers.

The Jenny Craig chain of weight loss centers headquartered in Carlsbad, Calif., owns some 450 centers across North America, and 100 in Australia and New Zealand. The corporation runs its J.D. Edwards financials and inventory processing packages off an AS/400, but the rest of the organization runs on a proprietary distributed DOS-based data-entry system called ICAN. The company uses the BusinessObjects reporting tool from Business Objects in San Jose, Calif.

Like many organizations, Jenny Craig wanted a more complete view of its customers. In May 2002, before an IT overhaul that called for its first data warehouse and a new POS/CRM system, the firm decided to clean the data from its current Microsoft SQL Server DBMS -- which had many duplicate entries among its 2.5 million client records -- as well as data from its ICAN systems, which did a poor job of controlling the data entered into the fields.

For the project, Jenny Craig chose Ascential Software's Enterprise Integration Suite, which consists of MetaRecon for data profiling, DataStage for ETL and meta data management, and Integrity XE for data quality.

At first, the firm was not sure it needed the entire suite. "But Ascential said we did, and they were right," said Sophia Ruiz, data warehouse manager at Jenny Craig. "Once we did the discovery on the data, we saw how bad it was."

Jenny Craig will use the suite to cleanse, standardize and transform the data from its DBMS and ICAN systems. "Once connected to our database, MetaRecon showed our data any way we wanted to see it," said Ruiz. "We need to look at standardizing our reports, but it doesn't help to standardize data if we know the data is wrong and if we see flaws driving bad business decisions."

Even though Jenny Craig was not bringing in all the tables of its database, DataStage was necessary because "we've inherited a Frankenstein kind of database," said Ruiz. "The DBMS started out in a logical manner, but as changes were made to ICAN, certain fields and tables were no longer used and IT staff grew afraid to delete anything. They just kept attaching pieces." MetaRecon showed Jenny Craig what they did not need to bring over, while DataStage showed them how to bring over only what they needed.

"Now we can transform the data to the format we need in the future; as files come in daily from the ICAN systems in the field, no work needs to be done to them," Ruiz said.

The Integrity component ensures data quality. Customizable to company business rules, Integrity provides control over names and addresses, as well as related data such as phone numbers and e-mail addresses -- and discovers relationships among them -- in batch and real-time. Ruiz uses this component to standardize names and addresses so the firm can postal-certify information for a mailing campaign, rather than send raw data to a third-party vendor for certification.

According to Jenny Craig CIO Sara Braziller, deploying the data warehouse before the POS/CRM system "has given us a chance to force users to talk about how they want to see the data and why." The next phase is to work with users and executives on creating business rules. "Nailing down the business rules is an important part of finishing this project and going to the POS system," said Braziller.

Morass of meta data

Another problem ETL tools have when it comes to data integration is that each one manages meta data differently. "One problem all organizations would like to solve is how to obtain a single meta data view of the entire enterprise," said Faisal Shah, co-founder and CTO at Knightsbridge Solutions, a systems integration and consulting firm in Chicago.

Companies are trying to remedy this by extending the meta data management capabilities of their software. For example, Business Objects enhanced the reporting data in its Data Integrator 6.0 package so that it can be integrated more easily with front-end business reporting and analysis tools. The updated meta data management capabilities also make it easier to update reports and restructure databases whenever applications are upgraded. Users can now look at a report and see a measure called the margin, which shows users the source of that calculation.

"Now, not only do they know how the piece of data was derived but where it came from," said Denise Sawicki, senior product marketing manager at Business Objects.

As vendors scurry to add capabilities to their ETL products, the market seems to be bifurcating into high- and low-end solutions. According to Ted Friedman, a senior product analyst at Stamford, Conn.-based market research firm Gartner Inc., some 75% of data warehouse builders still custom-code their own ETL tools and processes. But why so many holdouts?

"ETL tools are expensive, people's environments are complex, and the tools don't always speak to the nuances of the customer's legacy environments," Friedman said. "Besides, in this economy, more people are taking the approach that if it ain't broke, don't fix it."

But custom coding is highly time-consuming. "There's a lot of places where data integration is custom-built," said Sanjay Poonen, vice president of worldwide marketing at Informatica Corp., Redwood City, Calif. "But companies like ours can help you do it faster, and 'build faster' has caught the attention of many CIOs."

Next-generation ETL tools, high-end and low-end alike, have also moved into the real-time data integration space. In June 2002, Informatica unveiled a real-time version of its PowerCenter data integration software. The crux of PowerCenterRT is a zero-latency engine that processes trickle-feed data, allowing real-time data flows generated by transactional systems and messaging infrastructures to be integrated with historical, time-slice information stored in data warehouses. The price tag: $200,000.

If you are a small business and $200,000 is beyond your budget, you can turn to Data Junction Corp. The firm claims the enhanced Integration Engine in its new Integration Architect software package will perform optimally in heavy-volume, real-time integrated environments. It is hard to compare it in terms of price to PowerCenterRT, because Data Junction licenses its software.

"We feel we offer the best value in the data integration space," said Greg Grosh, co-founder co-founder of the Austin, Texas firm.

Or if you already use the Oracle DBMS, you can purchase the Oracle Warehouse Builder tool for the bargain price of $5,000, "1/50th of some of the other selling points out there," noted John Entenman, Oracle's vice president of business intelligence products.

In recent years, the critical data integration piece has come to encompass a broad stack of middleware, with workflow at the top end, then BPM and finally the messaging layers. One way to scrimp is to eliminate a layer.

"If money's tight and you have to shed something from that stack, you can lose the BPM," said Brian Baker, vice president of marketing at XML Global Information Systems, a vendor of XML middleware solutions in Vancouver, British Columbia. "But you can't lose the data integration. You can write some of the adapters yourself, but a lot of the heavy lifting is in the mapping and integration part."

All ETL vendors offer real-time capabilities, but Gartner Inc.'s Friedman is not so sure that they are not getting ahead of the market. When Gartner recently asked 500 organizations that were implementing business intelligence and data warehousing what their true latency requirements were, only 11% of them had a true need for real-time today.

Offering real-time features is not the same thing as giving the customer true real-time capabilities, said Oracle's Entenman. "What most vendors are offering is a business intelligence solution that had to do summarizations and analytics, and then aggregate the information. It takes time to do that."

Where real-time is necessary

However, real-time capability for its data warehouse was critical for ING U.S. Financials, which offers services such as retirement plans, employee benefits and mutual funds through many distribution channels, including a network of independent financial advisors. In the past few years, ING has jumped headfirst into the M&A waters and acquired 20 firms. These acquisitions left ING's IT leaders facing skyrocketing amounts of data locked in silos, many on proprietary legacy systems.

One year ago, the company decided to integrate these 20 firms. The task of integrating more than 3TB of data from more than 100 heterogeneous and legacy systems was further complicated by the need to move from a batch processing model to a near real-time model.

"It was like throwing 20 firms together to behave as one," said Ray Karrenbauer, CTO at ING Americas' Technology Management Office in Hartford, Conn. "We didn't want a band-aid solution -- an up-front portal collecting data from 20 downstream sources, where you'd still face the problem of back-office inefficiencies. We wanted a consolidation strategy." Accordingly, the Technology Management Office initiated a data warehousing project that went live in September.

The project focused on operational data delivery and analyzing the data for optimal business performance. The organization eventually opted for Informatica's PowerCenterRT data integration platform. Karrenbauer cited several reasons for selecting PowerCenter -- its ability to plumb into ING's legacy systems and pull that information into its relational structure, its ease of use, its handling of meta data, its speed and ability to handle large load volumes -- but said that one key factor was that it gave ING the ability to do bi-directional real-time updates. This means that the zero-latency engine in PowerCenterRT can capture transactional data from EAI message queues, and deliver information from a data warehouse through EAI queues to other enterprise apps and systems. The bi-directional real-time updates allow for continuous real-time receiving as well as publishing of data back into these message queues.

By being able to use multi-dimensional analysis tools in real-time, ING's service reps get a holistic view of the customer, enabling them to cross-sell and up-sell more effectively. And with more comprehensive real-time corporate risk management information at their fingertips, company executives can make better decisions.

Karrenbauer divides his challenges into three categories: people (making everyone from 20 different firms act as one); process (bringing about new disciplines so that employees do not fall back on doing things the traditional way because they need to get it done fast); and technology.

"The technological challenge is loading more and more data," he said. "We can do it widely, across multiple business units, or deeply, in subject areas, but either way, we need to balance the two." He is quick to add that the technology challenge is "the least of [his] concerns," and he attributes that to PowerCenter.

Real-time is also important to customers who need to do data integration via transformation on the fly. These real-time transformation products must support a wider range of structured data formats than flat files and databases.

Uniboard, Laval, Quebec, operates 10 plants in North America for manufacturing composite wood panels. Each of these plants ran BPCS manufacturing software from SSA Global Technologies. But in 1996, Uniboard's parent company, Kunz Group of Germany, purchased SAP R/3 ERP software and, in 2001, decided to standardize on R/3 corporate wide. At that point, the challenge became integrating R/3 and the Oracle 9i database applications that ran in the North American plants. In addition, the GP production system used at each of Uniboard's manufacturing facilities, and that ran off Oracle, also had to be integrated with R/3.

Uniboard planned to hand-code the required integrations until it saw that would be too time-consuming to meet its tight schedule. Because the firm needed to move data between Oracle apps and R/3 without disrupting business flow, adopting SAP at its plants was not an option.

"We had three months to be ready for an integration test in June, with a go-live date of October," said Andre Sauve, who joined Uniboard Canada as director of business information systems in March 2002. "The time constraint was a challenge for us," he added. Neither Oracle nor SAP had the integration solution Uniboard was looking for; the challenge was to find a data integration product that the integration team could deploy out of the box.

Uniboard had more than 50 IDOCs -- an SAP message format used for real-time propagation of data across applications -- that needed to be mapped to Oracle tables and vice versa. After hearing about XML Global Technologies, Uniboard did some proofs of concept, then implemented a customized version of GoXML. Using the graphical tools provided in GoXML Transform, Uniboard was able to add business logic into its data mapping. Tools from other vendors might have required a business process engine, multiple passes or, worst of all, custom code. With GoXML Transform, the mappings were completed in one pass without any custom code and without the need for an expensive BPM suite.

Today, when a production order is put into the SAP system, R/3 sends an IDOC to the IBM MQSeries message-oriented middleware contained in the hub in Uniboard's Laval plant. "With the IDOC, you have to get information from the data element and convert it to the Oracle tables," said the firm's Sauve. "We use GoXML to transform the data to insert a business rule into the transaction."

As soon as the IDOC is put into MQ, MQ passes it to GoXML to be transformed. Each time there is a transaction in the DBMS, Oracle replicates it and passes it to GoXML to transform the data back to an IDOC and then transfer it back to the SAP system.

Sauve said the GoXML GUI was so easy to use that Uniboard had its business analysts, rather than its developers, create the data transformations, knowing that the IDOC output generated by GoXML Transform would be consumable directly by the SAP system at the Kunz head office. "It's easy to take a transaction and make a business rule," said Sauve.

Uniboard said its initial investment in GoXML Transform was less than $50,000. "We could have purchased an app server solution, but the cost would have been five to 10 times as much," said Sauve. Uniboard calculates that the integration provided by GoXML Transform will save the firm more than $200,000 in the next two years.

EAI and Web services

The latest category of data integration tools is enterprise application integration, or EAI. Although some EAI vendors did a bit of ETL, and some ETL vendors do a bit of EAI, most observers seem to think that each will subsist, with ETL handling the transformation of data and EAI filling the need for a transactional system that handles workflow.

"EAI is an issue for us because we have so many legacy systems to tie together from an application perspective," said ING's Karrenbauer. "ETL transforms the moving data, but EAI is our first entry into BPM and business activity monitoring, or BAM, to manage the workflows."

Then there is Web services. A standards-based approach to B2B connectivity, EAI and other enterprise business as well as data integration apps, Web services is poised to deliver benefits in rapid deployment, scalability, flexibility and reuse.

"As the world shifts to a broader enterprise view, embracing technologies like Web services becomes critical," said Ascential's Cassettari. And count Sawicki at Business Objects as a Web services booster. Web services will make computing ubiquitous across all nodes, she said, adding, "Web services is the HTTP of tomorrow."

But while Web services has great potential for simplifying data integration, it will not eliminate the need for integration in general or APIs in particular. And it will not allow everyone to talk to everything, since they do not feature the transaction capabilities of the current crop of application servers and databases, or the rock-solid messaging of many of the middleware products.

Web services typically expose data or accept transactions. In either case, you need to go from the Web service to the application system containing the data. When packaged application providers start building SOAP interfaces, that will go a long way toward making Web services the preferred data integration technology. Data Junction is working on a SOAP adapter it hopes will deliver on the plug-and-play promise of Web services.

But when you add Web services into the picture, how things will play out is anyone's guess. Knightsbridge Solutions' Shah predicts that EAI will essentially serve as a plumbing vehicle for Web services. One thing is for sure, he said. ETL products are safe, because "the need for bulk work is not being mitigated by the need for Web services and EAI."

Gartner Inc.'s Friedman has concerns about Web services. A self-admitted "data guy," he said, "My fear is that Web services will hide a lot of the critical problems in data architectures. This concept that you can wrap your applications or databases in nice callable interfaces that will make your legacy stuff more accessible will cause people to avoid issues of data architecture and data quality."

Using Web services will mean a larger presence of XML data, said Baker at XML Global. "Companies are being compelled to move to XML by their partners. Wal-Mart is now only going to accept POS and invoices in XML format, so we have to map everything we have. That is the driving force behind the XML integration effort. There are still an enormous amount of databases and legacy systems that will never be in native XML format and will always require mapping."

Uniboard has started to archive transactions in XML format so it can perform analysis down the road. It still has to build the tool on top of GoXML to do the interrogation on the database. "That will be in the future," said Sauve.

Please read the associated story "Real-time requirements drive ETL-EAI convergence" by Jack Vaughan.