In-Depth

Managing unstructured information

Most of the information companies generate—more than 80 percent, according to experts—won’t fit into the neat-and-tidy cells of a traditional relational database. Getting a handle on unstructured information has suddenly become a front-burner issue. Just about every company is coping with an explosion of documents, PowerPoint slides, spreadsheets, PDFs, JPEGs, and MPEGs and a constant stream of e-mail messages. Federal regulations, such as Sarbanes-Oxley and HIPAA, have made organizations responsible for the great, scattered piles of information, even though they can’t easily manage it. Meanwhile, the ability to exploit unstructured data has turned into a competitive differentiator.


“Unstructured content has always been seventh or eighth on the priority list,” says Andrew Warzecha, an analyst with META Group, Stamford, Conn. “It moved up when the [tech industry] bubble burst and people started looking at consolidating systems to cut costs. It shot to the top when regulatory compliance became an issue. The fact that the CEO and the board are now being held accountable for how they manage this information is the reason this is a board-level issue right now.”

Most industry watchers agree regulatory compliance has shone a spotlight on the state of unstructured information in the enterprise, and has loosened a few purse strings. However, says Toby Bell, director in Gartner’s Knowledge Workplace, no one should forget compliance is just the first act.

“The natural, early response is to establish better controls on all enterprise content as a response to compliance issues,” Bell says. “It’s a tactical and pragmatic approach, and companies need to do it, but it’s not strategic. Companies also need to begin thinking in a more holistic way about this issue. Most organizations consider compliance to be a cost, but they’re not necessarily seeing that if they comply better, they can compete better.”

What’s structured? What’s not?
Definitions are somewhat mutable in this space, but whether you’re talking about unstructured data, information, or content, you’re essentially referring to the stuff on your servers that does not explicitly specify how it’s organized, how it relates to other data, or how it should be used. The two big buckets are bitmap objects, which are inherently non-language based—things like image, video, or audio files; and textual objects, which are based on written language, such as e-mails, spreadsheets, and Microsoft Word docs. META Group applies three categories here: • Structured information, which fits into database tables; • Unstructured information, which doesn’t; and • Semi-structured information, which covers XML documents and other self-describing data.

The amount of unstructured and semi-structured information in the enterprise is growing rapidly, doubling every year, by some estimates. Commonly used office productivity suites, such as Microsoft Office and Lotus SmartSuite, generate an enormous amount. A large enterprise is likely to comprise thousands of users generating tens of millions of files. Some productivity suites also include their own smaller databases—Access or FoxPro, for example. With the exception of a few specialized industries, most IT organizations have tended to leave the management of this type of information in the hands of end users, or, at best, small-unit managers.

Although structured data can be managed with solutions that support querying and reporting against predetermined data types and understood relationships, unstructured information has no conceptual definition or data type definition.

The technology for extracting metadata (such as a name, image format, size, resolution, or creation date) embedded in unstructured files is available, and the tools that digitize such objects are getting better at adding metadata. Document image capture software from companies such as Kofax and Captiva use so-called intelligent character recognition and zonal scanning techniques to extract additional metadata, such as chapter heads and specifically coded information on paper documents as they are scanned. Vendors including Autonomy, Convera, and Webverse offer technologies for managing and integrating JPEGs, PDFs, and audio files. But the ability to extract meaning from these unstructured objects is still in its infancy.

“Right now, many organizations are just trying to get their arms around how big a problem this is for them,” says META’s Warzecha. “And to a large extent, they just don’t know. We’ve seen a wide range of companies over the past 12 months going through self-assessments trying to determine how much information in their organization actually is unstructured, what categories that information falls into, and of those categories, which ones they should be formally managing, semi-managing, and not managing at all.”

Complexity makes it worse
The fact that companies typically have a number of information systems in place adds to the challenge. A recent META survey of mid-sized to large enterprises found that about one-third of companies have five systems or fewer, another third had between five and 10, and another third had 10 or more. In paper-intensive industries, such as financial services, insurance, and government, it’s common to find more than 20 of these systems within one organization, Warzecha says. He cites a worst-case example of a large multinational insurance company with more than 26 different content-handling systems.

“What’s interesting about that situation,” he says, “is that those systems were doing what they were supposed to. There really wasn’t a need to replace any of them for not meeting business requirements. The problem from an IT perspective was that they had to have 26 different administrators checking 26 different servers on a daily basis. They had to have developers that used 26 different API sets. They had to have support personnel that used 26 different environments.”

The widespread practice of allowing unstructured information to float around an organization unmanaged seems unwise on its face, and there are many examples of its problem-causing potential. E-mail, in particular, seems to be emerging as a sort of Achilles heel for companies without adequate controls.

“E-mail tends to be riskier than documents,” says Rich Buchheim, senior director of product management and enterprise content management strategy at Oracle. “People tend to be more casual about it, and it’s ubiquitous. Instant messaging is worse than e-mail. People are extremely casual about that, and the kinds of things that are actually being communicated through IM are pretty hair-raising.”

“There’s a reason the very first thing a prosecuting attorney goes after these days tends to be e-mail,” says Warzecha. “There’s a whole bunch of stuff that really shouldn’t be in there.”

A user’s story: spurred by a merger
When hard drive maker Western Digital acquired the assets of bankrupt Read-Rite last year, it got more than a warehouse full of hard-drive parts. Read-Rite’s document management system, which was expensive to maintain, would have to be merged with Western’s own aging systems, which it had been on the verge of retiring for months. The acquisition finally got the ball rolling.

Western uses separate document management groups at each of its facilities, including corporate headquarters in Lake Forest, CA, its design facilities in San Jose, and its manufacturing facilities in Malaysia and Thailand. Those groups manage a range of files, including manufacturing procedures, assembly instructions for the factories, product instructions, and policies in the forms of text documents, PDF files, and CAD drawings. Each group uses its own content management systems, which include home-grown systems and systems purchased from outside vendors.

“A big issue for us was security,” says Srinivas Ramachandruni, senior programmer specialist at Western Digital. “These are critical documents, and the idea was to get them together in one place so we could apply security policies uniformly, and manage and control the security around them.”

Western also hoped to uniformly implement backup and recovery protocols and procedures. “If you have all of your critical content in one place, you can establish proper backup and recovery,” Ramachandruni says. “Otherwise, you look up one day and you realize that in one corner of the world you have some critical documents that are not properly backed up and hard to restore.”

Western started with a small proof-of-concept project based on Oracle Files, which is a component of the Oracle Collaboration Suite, two Dell 2650 servers, and a Web-based front end. “We wanted to see whether the architecture was robust and whether it could handle this load,” Ramachandruni explains. “When that stabilizes, we will bring other people and other groups onto this single consolidated document management system.”

Western has centralized its Lake Forest, San Jose, and Malaysia documents. The new system is also allowing the company to get away from the tree structure, which Ramachandruni says isn’t very useful when it comes to maintaining and searching unstructured data. “We needed to be able to search based on the text that is stored inside the document,” he says. “No one from IT is required to maintain those documents, and no one needs to be trained on the document management system. The users can do simple searches and get to the documents they want very easily.”

Reining in the data
Beyond the potential for legal liability and other risks, and perhaps a greater concern in the long term, is that most enterprises are not effectively using roughly 80 percent of the information they generate. “If you think about it in terms of a corporate performance framework,” Bell says, “the idea that most companies have real control over only 20 percent of the intellectual assets of their business, and are making business decisions based on that, is pretty scary. It doesn’t make sense to leave all this valuable content on desktops, file servers, and filing cabinets.”

How do you get a handle on the stuff with no handle? More important, from an enterprise perspective, how do you extract value from these enormous stores of unstructured information?

Before you do anything, says META’s Warzecha, start by answering some fundamental questions about your business: • What are my core business processes? • What information is at risk here? • In what categories should the at-risk information be placed? • What can I do in the short term to shore up these processes to minimize risk and make sure we’re not fined by a regulatory agency?

Then, make longer-term recommendations about what you can do to automate some of these processes and procedures.

“Short term, it’s about identifying what information is at risk, and making sure that the gaps in your processes are filled on a manual basis and documented,” he says. “Longer term, large organizations should have an initiative under way, led by the CTO, to understand the company’s ongoing risks surrounding their information, and to investigate what types of technologies can best be implemented to automate those same things.”

Ramachandruni’s advice to enterprises is to start small. “We did not want to try to do all of this stuff at one go,” he says. “We started small and we’re scaling up. And we started with one element of our unstructured data. We focused on getting the basic manufacturing documents into one place. Over time, we’ll get it all consolidated.”