Data warehousing lets its hair down

Data warehousing has been in transition throughout 2001, as have the tools that support it and the business strategies it sustains. New sources of dynamic market information, including clickstream data from B2B and B2C Web sites, require operational and analytic tools that are nimbler and quicker than those that worked for brick-and-mortar firms a decade ago. While weekly or even monthly sales reports and analysis were adequate in the early days of data warehousing, most firms now operate in an environment where 24-hour-old data is considered obsolete.

This article looks at trends in data warehousing tools and issues, including logical integration and storage. We will also look at innovations in analytics designed to handle the dynamic data generated by B2B and B2C Web applications, which bring new levels of complexity to information management.

Where Web meets legacy
"It's gotten harder instead of easier," said Kay Hammer, president of Evolutionary Technologies International (ETI) Inc., Austin, Texas, a pioneer in data warehousing technology. "If you're using your data warehouse for Customer Relationship Management (CRM), you now need to get data from the clickstream logs—your customers who are operating online—so you can see what promotions are working. That's a whole new source of data that makes it more complicated."

In Hammer's view, integration of Web and legacy has become even more important as companies do business in the B2B and B2C world.

"In the warehouse world, you're usually taking data from your legacy system, transforming it and putting it in a relational system," Hammer said. "But in this new e-business arena, you're going to want to take a transaction that happens on the Web and you might need to update an IMS inventory system and a VSAM shipping system. You're going to have to take a transaction from the new world, do whatever data transformation you have to do, and update an application living in the old world. The flow of information is now in two directions. Data integration is not just about getting stuff out of the old world, cleaning it up and putting it in the new world. It's a matter of making the two talk together."

Because both enterprise integration and data warehousing applications can become time-consuming projects to enable and maintain, one strategy for coping with the massive amounts of data generated in B2B and B2C applications is to simplify and focus.

"Our usual take on data warehousing is that it is part of an overall business intelligence platform," said Jake Freivald, a specialist in data warehousing formerly at Information Builders Inc. (IBI), New York City, who is now an executive at iWay Software, an IBI spin-off.

One of the trends Freivald sees in data warehousing or business intelligence applications, as well as in integration, is the attempt to get data into more usable chunks. Freivald sees the eXtensible Markup Language (XML) as a crucial tool for this, since its tags allow users to extract just the information they need for analysis. XML helps resolve semantic issues regarding what constitutes meaningful data. Once the information is honed to digestible chunks, it is easier for end users to make sense of it.

"Some of the semantic issues are less difficult when you only deal with a small subset of the problem," Freivald said. "In other words, you're not eating the whole elephant."

But filleting an elephant is not easy, and some IT professionals must feel like big game hunters abandoned in the jungle with a gigantic carcass and a penknife.

Tom Ebling, CEO at Torrent Systems, Cambridge, Mass., stated the problem of data warehousing management in the era of B2B and B2C this way: "At the same time their data is growing, the time frame to process the data—whether that means putting it into a data warehouse or running some analytics off it to determine how to deal with customers—is shrinking due to competitive pressures and the fact that companies are dealing with their own customers in more real-time venues than they used to."

In the business of sales, time has always been important, but in Web commerce it has become crucial.

"It used to be that if you did a mailing campaign, and it took you 30 days to figure out whom to mail to vs. 29 days, nobody sweated it too much," recalled Ebling. "But [when] someone shows up at your Web site and you want to figure out what to present to them, you've got to do it while they're there or you've missed your opportunity."

Finding the grand, unified theory of data
Further complicating everything, as Ebling sees it, is the integration problem that ETI's Hammer also sees as a key issue. For a brick-and-mortar company expanding sales and services channels to Web sites, not all customer data is coming in via the Internet. Some information is still gathered the old-fashioned way. For example, a banking customer might do some online transactions but also send materials via snail mail, make voice calls to banking clerks and sometimes go to the branch in person. Is all this data collected uniformly? Does it all reside in the same database system? Will it all end up in the same data warehouse?

That is not very likely. Online and traditional brick-and-mortar channels create separate but equally difficult problems.

"What you see today, and this is true for almost all companies," said Ebling, "is that the application of their analytics and business intelligence in the two different channels is completely separate. There's no connection. They don't use the information from offline behavior to affect the way they deal with their customers online and visa versa. The reason is that it's very hard. The volumes of data are so large, pulling them together and analyzing them quickly enough is a very difficult task."

For Hammer, Ebling and other vendors of data warehousing tools, the search is now on for a grand, unified theory that will bring together information from online and traditional business transactions. This is the operational and management tool problem that must be solved before all the fancy analytical tools can go to work on the data.

If anything, the failure of so many pure companies has spotlighted this integration problem. Web-only ventures like—now up in doggy heaven—did not have this integration problem. They started off fresh with one database system collecting information from one Web site. This was data warehousing made easy. But since most of the survivors in B2C are established store chains, we now have data warehousing made difficult.

"Most of the companies with the biggest presence on the Internet in the next two years will also be companies that have a big presence outside the Internet," predicted Ebling. "And they're going to want to capture that [online] data to use it, not just in real-time context, but also in the context of the rest of their business.

"We've got a tool that takes clickstream data and converts it into visitor session information that can be stored in a data warehouse and used for future analysis," he added. "We're seeing a lot of interest in that product for that reason."

Of course, Ebling is not the only person looking for a way to integrate clickstream data into a larger data warehousing context.

Is the tool a problem or a solution?
One of ETI's customers is a major East Coast insurance company. The company decided to use ETI's tool for clickstream abstractions after a homemade tool was swamped by the complexity of Web application dynamics.

"The insurance company originally wrote its own clickstream extraction programs to populate its data warehouse," ETI's Hammer explained. "And then when they went to change their Web-based application and page design on the Web, their clickstream programs didn't work anymore."

When it comes to operational and analytical tools for data warehousing, the dynamism of Web commerce becomes as much a problem as a solution.

"The reason you like the Web is that it gives you a fast way to get a new market message to a whole bunch of people," explained Hammer. "But if you're going to figure out the effectiveness of that new promotion, then you've got to be able to capture the data. You need to have the methodology and tools that will help you. As you redesign that Web page or put that new promotion out there, [the tool] lets you get the data you need. How many people go to this particular product, but exit before buying it? If you have a huge number of people going to a particular page for a particular product and they don't buy, then that says something about what your competitors might be offering in that space. Customers might do a little comparison shopping and you don't come out very well. It's that kind of stuff that people want to know by looking at Web behavior and patterns."

Getting the right messaging software
Tools to integrate Web and enterprise systems into one data warehouse are the challenge of the new century.

"I have no doubt at all that you will see significant integration of data warehousing with B2B, e-commerce and similar kinds of systems," said iWay's Freivald, who has been working on this problem since he was "the data warehouse guy at Information Builders."

In Freivald's view, this integration will mean taking real-time data from a B2B transaction and running it through a message broker, which will send updates to the order processing system, accounting system and any of the other back-office systems that may be involved.

But one of the steps in that message flow will be, "update the data warehouse with the latest information," he said.

As Freivald explained: "That B2B transaction isn't necessarily captured in any single system. It might have information that's already reconciled semantically with what the data warehouse holds, and you want to put that directly in there instead of having to re-glean it from the operations systems. So, you'll see information flow directly into the data warehouse and that will provide a little bit more real-time capability on the data warehouse, too."

The biggest change brought about by B2C—where data may cover millions of customers—and to a lesser extent B2B—where information may involve hundreds or thousands of trading partners and vendors—is the movement from physical data warehouses to purely logical integration of information.

"B2B and B2C transactions, that sort of thing is all a hybrid of logical and physical integration," said Freivald. "Sometimes the integration is purely logical. For example, if gets information from its local database about a book title and goes back to my account on the shipping system to see when it was shipped, that's logical integration. It never gets written on a database anywhere, but it's written on the screen for me.

"But the data warehouse is physical integration, where you just move the data from one place to another and aggregate it that way," he added. "As B2B and B2C transactions continue to grow, we're going to see a continued need to semantically tie those things together, to make sure we've got a sheet of music we can sing from."

Technology for logical integration will represent a growing market in the new decade, according to Freivald.

"You're going to see incredible growth in the logical integration space, where you're tying together pieces of meta data from one organization to another, and where you're tying together transactions and enriching an XML document with information from other places," he explained. "But you are also going to see a continuing need to say, 'Well, I just did all this stuff to my B2B site, what did that mean to my bottom line?' To answer that question, we have to do the semantic integration."

Trends in analytic tools
Once the operational and integration issues are resolved, assuming they will be more or less at some point, the fun tools for data warehousing can be employed. Increasingly sophisticated analytic tools designed for the business end user can slice and dice clickstream data like there is no tomorrow. And given the need for real-time, up-to-the-minute data analysis, there really is no tomorrow.

A new breed of analytic tools not only finds obvious data matches, such as how many customers bought bicycles today, but provides the sales and marketing departments with new insights into cycling enthusiasts' buying habits.

digiMine Inc., Seattle, which provides data warehousing services on an ASP model, has developed data mining tools that allow for the segmentation of users based on their behavior. Its data mining algorithm can process Web data, including clickstream and online transaction data. The algorithm looks at the different types of users and their behavior, and then segments them to identify interests and buying habits.

"The reason user segmentation is so critical," explained Bassel Ojjeh, COO at digiMine, "is it can find patterns for you that you might not identify manually."

The segmenting algorithm is free of preconceptions about customers and identifies patterns that a marketer using a manual approach might miss. Segments are presented to the business user via a Web browser in a graphical chart showing "buckets" of information. The users, usually sales and marketing managers and executives, can drill down on an individual bucket and view a display of the four or five key attributes of that segment.

"It could be a completely new discovery for the business user," said Ojjeh. "For example, an outdoor equipment retailer with different departments feeds clickstream and transaction data into this algorithm and it will identify 10 buckets of users. The business user can then look at those buckets to find out what is unique about those customers. One segment of customers might be spending all their time on cycling. They come in via the cycling shortcut and, on average, they spend $200. The business user can take all those customers in the bucket and label them 'cycling enthusiasts' and the next time they come into the site they can target them with promotions for a new bike or cycling clothes and increase the chance of making a sale."

Such a system is not foolproof, as anyone who has ever relied on a computer to sort information will realize. Ojjeh allows that his company's tool will sometimes come up with a segment based entirely on random purchasing patterns that would be useless to even the most imaginative marketing maven.

"But there are some other segments," he argued, "probably about 70% of the segments, that it will discover for you. And you will say, 'Okay, this is actually a very meaningful segment. I can do something with this.'"

Operational issues
While complex issues in integration and analytical algorithms may be hot topics in discussions of data warehousing tools, the basics of keeping the whole system operational is still key for IT departments.

"With data warehousing in general, the biggest challenge is the operational part of it. The implementation part takes the least amount of time, considering the life cycle of the data warehouse," said Ojjeh. "Designing the schemas, the data models, providing the data transformation services and transforming the data—that process can be done in about eight to 12 weeks. The biggest problem with data warehouses is the operational part: Keeping the data warehouse up-to-date, the schemas updated as new business dimensions are created and making sure the data is going into the warehouse on a regular basis. If that doesn't happen, data warehouses tend to become stale and business users will rely on them less because they do not have current data, up to a point where the data warehouse becomes obsolete."

With the new dynamism introduced through Web applications, the ability to quickly adapt to constantly changing information management demands is viewed as crucial to data warehousing success.

"One of the key things that will make or break people as they move forward is making sure the tools they choose and the methodology they use supports efficient change management," said ETI's Hammer.

She argued that data warehousing management tools need to capture an audit trail of all the interrelationships that come into play in all the data interfaces between Web and back-office systems.

"If your system is running in near-time and something changes, it's not acceptable to take two weeks to adapt your system," Hammer said. "So, being able to do impact analysis very quickly when something changes, and to respond to that change very quickly is going to be key."

Another crucial operational issue is that mundane problem of storage management. A system that collects clickstream data from millions of consumers is obviously going to generate a lot of bytes.

"What we've seen is that a company's data is growing at a very rapid rate," said Torrent Systems' Ebling. "That's been happening for a very long time, and the Internet just makes it worse because it just keeps accumulating more data."

The good news is that vendors are offering new technologies to handle the data deluge, including tools for storage management, analysis and online backup.

"With the ever-increasing volume of data and users accessing a data warehouse, it is critical to choose the right storage technology for the right data warehouse application," said digiMine's Ojjeh. "Storage vendors are adding new software and hardware technologies specifically designed for data warehousing systems."

A new era for data warehousing tools
The growth of B2B and B2C Web applications is bringing change to the information management arena, but the one thing it is not doing is making data warehousing obsolete. As long as sales managers need to know how their products are moving and executives need to identify trends in their industries, they will need data warehouses and the analytic tools that help them to understand how the business is doing.

The new challenge for IT departments is to employ the newest management and operational technologies to ensure that information, integrated from Web and back-office systems, is accurate and as close to real time as possible. Because in the new era of data warehousing, there is no tomorrow.