In-Depth
Getting control of data
- By Lana Gates
- July 1, 2002
Do you ever feel lost in a whirlpool of information when trying to search for
data on a specific topic? Most search engines have no trouble locating
information, but they often come back with so much information that it takes
extensive time and effort to wade through it.
Part of the problem with these search engines is that they are unintelligent
and unable to filter through various meanings of words, so they come back with
every reference to a particular word, regardless of its meaning. Unfortunately,
this leaves many, if not most, users frustrated at the myriad results. Users
have to have a pretty good idea of what they are looking for so the search
engine can provide them with the appropriate results. But what if you are unsure
of how to word your search? How can you narrow your search down to provide the
results you want?
While there are various content management apps on the market, a new breed of
content management-enabled apps is quickly rising to the forefront as a quicker,
easier way to get to the information you want. Very simply, a content
management-enabled app is software that, through categorization or text mining,
manages the content it provides. It is like a search engine, only better and
more robust.
Henry Morris, group vice president for applications and information access at
IDC, Framingham, Mass., defines these content management-enabled apps as those
that can leverage structure within documents. They are apps that look for
information within and across documents, and then present all of their findings.
''They're using content below the level of just pointing you to a document or
something [else],'' Morris noted.
Using categorization or text mining, these apps extract or identify concepts
and meanings within documents. ''They're extracting concepts and relationships
out of documents and setting up a presentation layer so you can see it in one
view,'' he said. They are able to look at content at a granular level.
Because most search engines and information systems leave users frustrated
with their inability to interact with information, Cambridge, Mass.-based Endeca
Technologies Inc. set out to solve the problem of information access in late
1999. The company believes it is solving the information access problem through
its Guided Navigation technology, which relies on the Endeca Navigation
Engine.
''We're trying to unlock the value of enterprise and catalog stores to
present users with a new approach to finding information that's already within
their enterprise,'' noted CTO David Gourley.
Endeca's Navigation Engine functions as a data access layer; it takes
information from relational databases, the Internet or e-commerce sites and
presents it to users in such a way that they do not have to know the right
question to get results. ''With Endeca, you start with results and narrow it
down,'' Gourley explained. ''You have a dialog with the information that's
stored.'' Through its Navigation Engine, Endeca is tying the search and
navigation encounters together into one experience for its users.
As a provider of technical standards, specifications, logistics and parts
information to corporations, Englewood, Colo.-based IHS Engineering was looking
for a way to deliver content reliably in a 24x7 environment when it turned to
Endeca. IHS wanted to standardize on one platform with the performance,
scalability and flexibility to deliver all of its content through, as well as to
reduce the number of dependencies it had with various vendors, said Paul Magin,
vice president of technology and product development. ''We had acquired so much
content across so many types of disciplines, and [we] had a half-dozen search
technologies,'' he said.
IHS has five applications running on five pipelines to the Endeca Navigation
Engine. Its Aviation product provides air-worthiness information, while its
Automotive Standards information app contains information from manufacturers
specific to the automotive industry. The CatalogXpress app is a massive database
of supplier content that includes catalogs to data sheets and specifications,
and the Text Advantage app is a full-text search engine that searches all
significant industry standard collections. The Techconnect client plug-in can
run natively on Windows or in a browser, and analyzes content in focus on a
user's monitor. ''It looks at what you're looking at and determines if anything
you're looking at can be found in our content,'' Magin explained.
All of these apps are updated daily, except Text Advantage, which is updated
every 90 days. In addition, the next phase of the project places another Endeca
engine in front of these five. The sixth will work as a master engine that can
locate references to specific things in all the apps. ''The user doesn't have to
worry about where the data's located,'' noted Magin.
When it began its search for a solution to deliver content reliably, IHS was
looking for an engine that supported various types of media, including the
Internet. Because some of its customers do not have Internet access, however,
the company was forced to consider engines that could navigate through both
Internet and CD-ROM data. Said Magin: ''The solutions that would serve both were
too costly in developer resources and [meant] having to bend the product one way
or the other,'' meaning more toward Internet or more toward CD-ROM.
After evaluating firms and products for its next-generation search engine,
IHS selected Endeca. ''We needed something slick, elegant, with not a lot of
administrative burden and where results could be easily replicated,'' Magin
said. In five to 10 days, Endeca was able to take IHS's data and turn it into a
proof of concept. Endeca ''had better performance and better search
functionality than some of our existing products because so much time was spent
tuning the various back-end systems,'' he added.
Using Endeca's technology has resulted in an overall improved user
experience, shorter time to market and increased simplicity for IHS.
''Performance is a tenfold increase for the same amount of traffic,'' Magin
said, adding that the company had a lot more hardware before turning to Endeca.
''Because of the way we're applying Endeca refinement technology, we're able to
take half-structured, half-unstructured content. It's difficult for a user to
select an answer that doesn't produce a result. Before, the user had to evaluate
the results.''
In addition, he said, ''Endeca makes it practical to do real-time refinement
while you search. You're always in a forward motion while you're searching. It's
lightning-fast under a heavy load.'' If you need to back up in your search,
Magin said, you can do so without starting a new query.
Finding the trees in the forest
Ronen Feldman,
co-founder, president and chief scientist at ClearForest Corp., New York City,
was looking for a way to get to the bottom line of information without having to
do a lot of unnecessary reading along the way when he developed a text-mining
technique in the early 1990s. That technique of reading text within documents is
what makes ClearForest's technology work. ClearForest focuses on unstructured
content and markets content management-enabled products to aid end users in
avoiding unnecessary reading.
''We're in the gold-digging business,'' explained Barak Pridor, CEO. The
company's technology sifts through mounds of information to locate relevant
nuggets. It goes a step further, however, by making those nuggets useful to its
end users.
All of ClearForest's products use intelligent auto-tagging as their
foundation. They go through each document and identify relevant entities, events
and facts as determined by the end user. The products also utilize business
intelligence to pinpoint relationships across documents. ClearResearch searches
documents coming from multiple sources. ClearEvents monitors news feeds, tags
relevant business events and attaches alerts about pertinent events. ClearSight
provides a visual roadmap to navigate through data quickly and easily. ClearTags
provides in-depth meta-tagging of textual data.
New York-based Thomson Financial was in the process of implementing ClearTags
into its workflow at press time and expected to have it up and running by the
end of the second quarter or the beginning of the third. Thomson was looking for
a way to index data at a granular level to provide more value-added indexing and
extraction technologies for its clients, said Steven Segenchuk, director of
content management in the Boston office.
After undergoing a six- to nine-month evaluation process, Thomson narrowed
its content management-enabled applications vendor candidates down from about
seven firms to three for more detailed analysis. The firm chose ClearForest's
technology over the others mainly because of its information extraction
abilities. ClearForest's ClearTags has the 'ability to identify and pull events
out of a document so they can say to a good degree of accuracy that there's a
discussion about a merger in this document and these are the two companies
merging. The same with acquisitions,' Segenchuk explained. ''Nobody else had
that to the level ClearForest did.''
Another example of a content management-enabled application is Strategic
Legal Management from LexTech Inc., Auburn, Calif. Although attorneys may not be
that technologically savvy, they still have a need for the information that
technology can provide. Strategic Legal Management manages and extracts content
from analytical applications, including legal bills and other legal documents.
Using both categorization and text mining, the product goes through text and
assigns binary values to it, while simultaneously establishing patterns and
applying business rules to that text. The software then runs the text against
its taxonomy and assigns the contents a place in that taxonomy based on business
rules and pattern recognition.
LexTech uses proprietary software to convert a paper law firm invoice into an
electronic file format called Ledes (which stands for legal electronic data
exchange standard). Companies like Blue Shield of California send outside
counsel invoices to LexTech to be converted into Ledes files and entered into
their database to be viewed online. ''They also send Ledes files to us
electronically so we can put them in our practice management system and access
them on our own,'' said Seth Jacobs, senior VP and general counsel for Blue
Shield of California in San Francisco.
Blue Shield turned to LexTech when law firms it was dealing with were unable
to produce invoices in electronic format. Blue Shield of California relies on
LexTech's technology to notice data trends in its legal financial information.
''We're seeking to identify a means of comparing lawyer performance to other
lawyer performance,'' Jacobs explained. ''When the financial information is
converted and put into a database form, we can report off the database just like
any other data to determine its metrics for performance. We can create a report
that would show us, for example, the average cost for a motion for summary
judgment by case type.''
The biggest benefit to Blue Shield is the ability to look at macro data
across many different firms, time periods and cases to identify trends. Before
engaging LexTech, Blue Shield of California had to do this type of task
manually, if it did it at all. Because a legal invoice can be 10, 20 or even 100
pages long, ''the only way to track this kind of data would be to go through
hundreds of pages of legal invoices, put information into an Excel spreadsheet
and manually grind it out,'' Jacobs said. ''It was not really feasible to do
this manually.''
Strategic Legal Management has let Blue Shield of California ''look at data
in ways we've never been able to see it before so we can better manage our
outside counsel expense,'' Jacobs said. And there is no manual intervention
required. ''The only time it takes is the time to design the report, query the
system and have the report show up on the computer,'' he said.
A handful of other firms are offering similar technologies. They include
Attensity Corp., Salt Lake City, Biz360 Inc., San Mateo, Calif., and WhizBang
Labs Inc., Provo, Utah. More mainstream content management providers are likely
to follow because there is a definite need in the industry for this type of
technology. IDC's Morris believes it will be another three to five years before
we see widespread adoption, however.
Considerable growth still needs to take place in categorization and text
mining before these applications become more widespread. In the meantime, Morris
said, you can expect to see more specialized applications. The area of content
management is continuing to grow and gain interest, as is verticalization and
industry-specific applications. According to Morris, ''Winning applications will
bring together content, collaboration and analysis in a vertical-specific
process.''
For now, vendors will continue to find ways to make information in and across
documents more useful to users. ''Today, in the age of infoglut, we have lots of
content available within our enterprise repositories,'' said Geoffrey Bock,
senior consultant/analyst at Patricia Seybold Group, Boston. ''But unless we
have the capabilities to find just what we need, just when we need it, all of
these stored information assets represent nothing more than bits on disks.'' So,
a-mining we will go.