In-Depth

Getting control of data

Do you ever feel lost in a whirlpool of information when trying to search for data on a specific topic? Most search engines have no trouble locating information, but they often come back with so much information that it takes extensive time and effort to wade through it.

Part of the problem with these search engines is that they are unintelligent and unable to filter through various meanings of words, so they come back with every reference to a particular word, regardless of its meaning. Unfortunately, this leaves many, if not most, users frustrated at the myriad results. Users have to have a pretty good idea of what they are looking for so the search engine can provide them with the appropriate results. But what if you are unsure of how to word your search? How can you narrow your search down to provide the results you want?

While there are various content management apps on the market, a new breed of content management-enabled apps is quickly rising to the forefront as a quicker, easier way to get to the information you want. Very simply, a content management-enabled app is software that, through categorization or text mining, manages the content it provides. It is like a search engine, only better and more robust.

Henry Morris, group vice president for applications and information access at IDC, Framingham, Mass., defines these content management-enabled apps as those that can leverage structure within documents. They are apps that look for information within and across documents, and then present all of their findings. ''They're using content below the level of just pointing you to a document or something [else],'' Morris noted.

Using categorization or text mining, these apps extract or identify concepts and meanings within documents. ''They're extracting concepts and relationships out of documents and setting up a presentation layer so you can see it in one view,'' he said. They are able to look at content at a granular level.

Because most search engines and information systems leave users frustrated with their inability to interact with information, Cambridge, Mass.-based Endeca Technologies Inc. set out to solve the problem of information access in late 1999. The company believes it is solving the information access problem through its Guided Navigation technology, which relies on the Endeca Navigation Engine.

''We're trying to unlock the value of enterprise and catalog stores to present users with a new approach to finding information that's already within their enterprise,'' noted CTO David Gourley.

Endeca's Navigation Engine functions as a data access layer; it takes information from relational databases, the Internet or e-commerce sites and presents it to users in such a way that they do not have to know the right question to get results. ''With Endeca, you start with results and narrow it down,'' Gourley explained. ''You have a dialog with the information that's stored.'' Through its Navigation Engine, Endeca is tying the search and navigation encounters together into one experience for its users.

As a provider of technical standards, specifications, logistics and parts information to corporations, Englewood, Colo.-based IHS Engineering was looking for a way to deliver content reliably in a 24x7 environment when it turned to Endeca. IHS wanted to standardize on one platform with the performance, scalability and flexibility to deliver all of its content through, as well as to reduce the number of dependencies it had with various vendors, said Paul Magin, vice president of technology and product development. ''We had acquired so much content across so many types of disciplines, and [we] had a half-dozen search technologies,'' he said.

IHS has five applications running on five pipelines to the Endeca Navigation Engine. Its Aviation product provides air-worthiness information, while its Automotive Standards information app contains information from manufacturers specific to the automotive industry. The CatalogXpress app is a massive database of supplier content that includes catalogs to data sheets and specifications, and the Text Advantage app is a full-text search engine that searches all significant industry standard collections. The Techconnect client plug-in can run natively on Windows or in a browser, and analyzes content in focus on a user's monitor. ''It looks at what you're looking at and determines if anything you're looking at can be found in our content,'' Magin explained.

All of these apps are updated daily, except Text Advantage, which is updated every 90 days. In addition, the next phase of the project places another Endeca engine in front of these five. The sixth will work as a master engine that can locate references to specific things in all the apps. ''The user doesn't have to worry about where the data's located,'' noted Magin.

When it began its search for a solution to deliver content reliably, IHS was looking for an engine that supported various types of media, including the Internet. Because some of its customers do not have Internet access, however, the company was forced to consider engines that could navigate through both Internet and CD-ROM data. Said Magin: ''The solutions that would serve both were too costly in developer resources and [meant] having to bend the product one way or the other,'' meaning more toward Internet or more toward CD-ROM.

After evaluating firms and products for its next-generation search engine, IHS selected Endeca. ''We needed something slick, elegant, with not a lot of administrative burden and where results could be easily replicated,'' Magin said. In five to 10 days, Endeca was able to take IHS's data and turn it into a proof of concept. Endeca ''had better performance and better search functionality than some of our existing products because so much time was spent tuning the various back-end systems,'' he added.

Using Endeca's technology has resulted in an overall improved user experience, shorter time to market and increased simplicity for IHS. ''Performance is a tenfold increase for the same amount of traffic,'' Magin said, adding that the company had a lot more hardware before turning to Endeca. ''Because of the way we're applying Endeca refinement technology, we're able to take half-structured, half-unstructured content. It's difficult for a user to select an answer that doesn't produce a result. Before, the user had to evaluate the results.''

In addition, he said, ''Endeca makes it practical to do real-time refinement while you search. You're always in a forward motion while you're searching. It's lightning-fast under a heavy load.'' If you need to back up in your search, Magin said, you can do so without starting a new query.

Finding the trees in the forest
Ronen Feldman, co-founder, president and chief scientist at ClearForest Corp., New York City, was looking for a way to get to the bottom line of information without having to do a lot of unnecessary reading along the way when he developed a text-mining technique in the early 1990s. That technique of reading text within documents is what makes ClearForest's technology work. ClearForest focuses on unstructured content and markets content management-enabled products to aid end users in avoiding unnecessary reading.

''We're in the gold-digging business,'' explained Barak Pridor, CEO. The company's technology sifts through mounds of information to locate relevant nuggets. It goes a step further, however, by making those nuggets useful to its end users.

All of ClearForest's products use intelligent auto-tagging as their foundation. They go through each document and identify relevant entities, events and facts as determined by the end user. The products also utilize business intelligence to pinpoint relationships across documents. ClearResearch searches documents coming from multiple sources. ClearEvents monitors news feeds, tags relevant business events and attaches alerts about pertinent events. ClearSight provides a visual roadmap to navigate through data quickly and easily. ClearTags provides in-depth meta-tagging of textual data.

New York-based Thomson Financial was in the process of implementing ClearTags into its workflow at press time and expected to have it up and running by the end of the second quarter or the beginning of the third. Thomson was looking for a way to index data at a granular level to provide more value-added indexing and extraction technologies for its clients, said Steven Segenchuk, director of content management in the Boston office.

After undergoing a six- to nine-month evaluation process, Thomson narrowed its content management-enabled applications vendor candidates down from about seven firms to three for more detailed analysis. The firm chose ClearForest's technology over the others mainly because of its information extraction abilities. ClearForest's ClearTags has the 'ability to identify and pull events out of a document so they can say to a good degree of accuracy that there's a discussion about a merger in this document and these are the two companies merging. The same with acquisitions,' Segenchuk explained. ''Nobody else had that to the level ClearForest did.''

Another example of a content management-enabled application is Strategic Legal Management from LexTech Inc., Auburn, Calif. Although attorneys may not be that technologically savvy, they still have a need for the information that technology can provide. Strategic Legal Management manages and extracts content from analytical applications, including legal bills and other legal documents.

Using both categorization and text mining, the product goes through text and assigns binary values to it, while simultaneously establishing patterns and applying business rules to that text. The software then runs the text against its taxonomy and assigns the contents a place in that taxonomy based on business rules and pattern recognition.

LexTech uses proprietary software to convert a paper law firm invoice into an electronic file format called Ledes (which stands for legal electronic data exchange standard). Companies like Blue Shield of California send outside counsel invoices to LexTech to be converted into Ledes files and entered into their database to be viewed online. ''They also send Ledes files to us electronically so we can put them in our practice management system and access them on our own,'' said Seth Jacobs, senior VP and general counsel for Blue Shield of California in San Francisco.

Blue Shield turned to LexTech when law firms it was dealing with were unable to produce invoices in electronic format. Blue Shield of California relies on LexTech's technology to notice data trends in its legal financial information.

''We're seeking to identify a means of comparing lawyer performance to other lawyer performance,'' Jacobs explained. ''When the financial information is converted and put into a database form, we can report off the database just like any other data to determine its metrics for performance. We can create a report that would show us, for example, the average cost for a motion for summary judgment by case type.''

The biggest benefit to Blue Shield is the ability to look at macro data across many different firms, time periods and cases to identify trends. Before engaging LexTech, Blue Shield of California had to do this type of task manually, if it did it at all. Because a legal invoice can be 10, 20 or even 100 pages long, ''the only way to track this kind of data would be to go through hundreds of pages of legal invoices, put information into an Excel spreadsheet and manually grind it out,'' Jacobs said. ''It was not really feasible to do this manually.''

Strategic Legal Management has let Blue Shield of California ''look at data in ways we've never been able to see it before so we can better manage our outside counsel expense,'' Jacobs said. And there is no manual intervention required. ''The only time it takes is the time to design the report, query the system and have the report show up on the computer,'' he said.

A handful of other firms are offering similar technologies. They include Attensity Corp., Salt Lake City, Biz360 Inc., San Mateo, Calif., and WhizBang Labs Inc., Provo, Utah. More mainstream content management providers are likely to follow because there is a definite need in the industry for this type of technology. IDC's Morris believes it will be another three to five years before we see widespread adoption, however.

Considerable growth still needs to take place in categorization and text mining before these applications become more widespread. In the meantime, Morris said, you can expect to see more specialized applications. The area of content management is continuing to grow and gain interest, as is verticalization and industry-specific applications. According to Morris, ''Winning applications will bring together content, collaboration and analysis in a vertical-specific process.''

For now, vendors will continue to find ways to make information in and across documents more useful to users. ''Today, in the age of infoglut, we have lots of content available within our enterprise repositories,'' said Geoffrey Bock, senior consultant/analyst at Patricia Seybold Group, Boston. ''But unless we have the capabilities to find just what we need, just when we need it, all of these stored information assets represent nothing more than bits on disks.'' So, a-mining we will go.