In-Depth
Data glut? Call out the search engines
- By Jason J. Meserve
- July 13, 2001
The advent of the Internet and an explosion of electronic data have
made a wealth of information available at the tip of a finger. Information
is stored in multiple data sources spanning wide geographical areas, making
it difficult at best to find pertinent data quickly.
A simple, single- or double-word query to an Internet-based search engine,
such as Digital's AltaVista (www.altavista.com)
or Yahoo! (www.yahoo.com), typically
returns about 10,000 possible results or "hits" based on an
algorithmic scale of relevancy. The good news is that all the possibilities
are in the same HTML format, virtually guaranteeing that they will be
readable through a browser.
A similar issue is spawning in corporations as they face an overabundance of information. While on a smaller
scale,
application development managers attempting to implement search technology face a two-fold problem. First, as with
the Internet, the sheer number of documents that need to be searched and/or indexed by a search engine can be daunting.
But second, and more important, is the file format and physical location of the data. Documents can take the form
of Microsoft Word files, database records, Adobe Acrobat (PDF) files, image files and legacy data -- all located
on multiple servers in physically dispersed areas.
Add to that the limited ability of most end users to take what they are thinking about and construct a query
to find the information exactly as they are processing it in their minds. A telephone sales representative at a
computer hardware distributor trying to find which brands of "disk access controllers" their company
carries would have minimal success in discovering the answer by entering those words alone into most search engines.
Documents containing the individual words "disk," "access" and "controller" would
be returned, but none may have anything to do with the actual device in question.
"'Search technology' is a dangerously ambiguous phrase," commented Curt Monash, president and founder
of Elucidate Technologies Inc. and Monash Information Services, both of Lexington, Mass. "These products combine
database indexing and management with a user interface, which has recreated the problem that relational databases
fixed."
With a relational database, explains Monash, users are limited to searching specifically defined columns of
data of a particular type using a structured query, SQL for instance. New search technologies have exacerbated
the problem SQL solved, insisting users learn a different type of query language to search a spectrum of
META SEARCH TOOL CRITERIA
- Query accuracy and performance -- algorithms, voting
- Ancillary functions -- clustering, summarization, semantic search
- Location and index of multiple data types
- Push, agents and repository links
- Centralized indexing augmented with profiles/taxonomies and crawling
- "Extras" -- knowledge mapping, hit highlighting, image search
- Price -- target $100 per seat in negotiations
SOURCE: Meta Group Inc., Stamford, Conn.
|
unstructured data. "[Today's] queries have to be crafted, and it is absurd to think an end user is going to
be able to do it or have a template that accomplishes the goal," said Monash. "Templates need to have
objects that change depending on the type of query."
Elucidate is attempting to re-correct the flaws of searching by adding subject sensitivity to search engine
interfaces. When looking for information on a given company, a search engine has to be aware of whether the company
is privately or publicly held, since different information exists depending on the case, illustrated Monash. Ultimately,
Monash believes, the best search interface will not let a question be asked that cannot be answered and is based
on natural language queries. Still in prototype development, Elucidate is trying to correct the flaws of modern
search technology.
In the meantime, there is a variety of decent search technology shipping for the corporate Intranet, Web site
or client/server network to help users wade through the glut of information available to them from their desktop.
And the technology is improving daily.
With Boston-based Delphi Group estimating a $534 million market up for grabs in the 1997 search technology marketplace,
Verity Inc. of Sunnyvale, Calif. and Fulcrum Technologies Inc. of Ottawa are among those leading the charge with
a bevy of products for all types of searching needs -- from departmental servers to enterprisewide document retrieval.
As for Wall Street darlings such as Lycos, Framingham, Mass., and InfoSeek, Sunnyvale, Calif., they may have
a strong online presence, but have not made inroads into the corporate landscape, said Michael Sullivan-Trainor,
program director for Internet Research at Framingham, Mass.-based IDC Research.
Prominent player Verity's core technology is its Development Kit, which provides developers and third-party
vendors with a C application programming interface (API) for embedding into their applications. This technology
allows developers to both build the necessary indexes as well as search documents in their native format, be it
Word, PDF, ASCII or HTML.
"Our knowledge-based architecture allows one to define 'rules of evidence' that are associated with a given
subject," said Ron Weissman, Verity's vice president of marketing. The knowledge base helps break search results
into hierarchical data sets that are related. These relationships can take the form of a complex taxonomy such
as medical terminology, Weissman explained.
Documentum Inc., Pleasanton, Calif., is embedding Verity technology into its Documentum Enterprise Document
Management System (EDMS), a suite of tools that manages all the procedures between document creation and reuse.
Via its DocPage Server, the EDMS provides services for both documents and Web pages in their native formats, while
managing changes to each document and keeping track of its location within the enterprise network. With such document
handling capabilities, search integration is a must. In 1993, when the first version of EDMS shipped, Documentum
looked at all the available search technologies, said Whitney Martin, a product marketing manager for Documentum.
"Verity wanted to start a partnership, which worked well for us at the time," continued Martin.
Documentum uses Verity's technology to create a more robust search based on content or file information attached
to a given document, such as attributes and meta data. "Search technology is becoming one of the most important
pieces of the architecture," Martin said. "The Web has highlighted that as users are expecting search
in any application they use."
Verity typically sells its technology in bulk OEM deals to vendors -- witness the Documentum deal -- for integration
into custom applications. One problem Verity faces in terms of marketing is that many end users use their product
unknowingly. "Verity is in Netscape and Lotus but users don't experience it as Verity, just as part of the
application," said Sullivan-Trainor.
Recently, Verity began offering more out-of-the-box functionality based on its core technology under the Search
'97 family of products. The most prominent tool in the group is IntelliServ for Windows NT -- a solution for delivering
filtered information to users based on preferences. The package, which starts at $7,995, can notify users via E-mail,
pager, ticker or custom start page.
Verity, which had gained considerable mind share in recent years, is feeling pressure. Fulcrum offers SearchServer,
a document indexing and retrieval system, SearchBuilder for Java and C++ for customizing searches within applications
and a new integrated suite of indexing and retrieval tools called Knowledge Network 2.1, which shipped the beginning
of last month.
"We are very focused on adhering to open standards," said Dave Haskins, Fulcrum's vice president of
advanced systems research. The company is especially focused on Microsoft standards, using the ODBC and JDBC interfaces
as gateways to their products.
Knowledge Network was launched in part as a reaction to the emerging Internet paradigm and also to help enterprises
manage ever-changing and moving documents. "Knowledge Network points to the existing information source and
automatically indexes and integrates into a single knowledge base," Haskins said.
Fulcrum is targeting Knowledge Network at customers with cross-industry needs, broad distribution organizations,
Lotus Notes or Microsoft Exchange environments, file systems and databases containing large amounts of documents,
and a desire to have single points of entry for accessing applications, according to Haskins.
Tory Tory DesLauries & Binnington, a large law firm based in Toronto, is now in the pilot phase of a project
using Fulcrum Knowledge Network 2.1, as it attempts to make it easier for its 200 lawyers and 450 administrative
staff members to search a variety of memos and precedence stored in the company's many databases, most of which
are stored in Microsoft Word format. Information and documents used in previous cases can be valuable assets for
lawyers during litigation.
According to John Cameron, a partner at Tory Tory DesLauries & Binnington, Knowledge Network's ability to
create "virtual files" from disparate data sources made the product an attractive offering. Virtual files
allows the law firm to set up groups of databases as a single file, such as "memos," on the users desktop.
A user searching the "memos" file does not know the search is actually spanning multiple data sources
and types. "We're trying to make it easier for less technically savvy lawyers [and staff members] who may
not know everything about the databases," Cameron said.
Cameron compared his firm's Knowledge Network project to that of an Intranet, which the company does not have
at the moment. "This tool is a simple way to obtain the advantage of an Intranet without any programming at
all," Cameron said.
An Intranet, he added, has to be able to take a user through a subject, almost leading them by the hand, link
by link. "That would be a lot of work on both our technical- and legal-side to create and maintain links to
guide our users to the documents they need," Cameron said. "Fulcrum provides a down and dirty way to
[create an Intranet.]"
Currently, the beta version Knowledge Network 2.1 is operating on a Pentium 60 machine with 64Mb of RAM and
serving 30 to 40 users without problem, according to Kevin Wilson, project manager in the firm's software development
and research department. "We could probably add a larger number of users on the current machine," Wilson
said. "The problem arises when indexing large amounts of documents, which takes more processing power."
Ultimately, Wilson and Cameron believe theirs will be in the neighborhood of five gigabytes of data broken up
into multiple virtual files once the system goes into full production, slated for completion by year's end.
Coming at the search technology/ knowledge management space from an imaging angle is Excalibur Technologies
Corp., Vienna, Va., and its RetrievalWare product series. The company is using a hybrid of semantic networks (concept
searching) and neural networks (fault-tolerant fuzzy searching) to go beyond standard Boolean searches. Much of
the company's core technology comes from work it has done over the years for the U.S. government, especially in
the image and pattern recognition areas.
"Everyone is moving towards more linguistic and semantic-based searching," said Mark Demers, director
of marketing at Excalibur. "There is a greater need for corporations to access data across all assets in the
enterprise and all users to retrieve data more intuitively."
Excalibur has three offerings. First, Excalibur RetrievalWare performs standard text and document searching;
second, Excalibur Visual RetrievalWare accomplishes the same for images and video based on specific patterns; third,
Excalibur Internet Spider can be used for searching multimedia content on the Web. The latter, based on technology
it acquired from InterPix Software Corp., Santa Clara, Calif., is designed to be an add-on to the first two products,
but can be sold as a stand-alone to compliment an existing data management system.
In October, Excalibur released version 6.5 of its product suite, featuring the RetrievalWare File Room. This
latest feature leverages adaptive pattern recognition processing (APRP) and allows fuzzy searches that can overcome
optical character recognition and input errors associated with transferring paper-based information into an electronic
form, such as misspelled or misinterpreted words.
Analysts and vendors alike agree that the search market is becoming "commoditized" and moving more
towards the knowledge management space. "[User] expectations are beginning to outstrip the search engine's
ability," said Jeffrey Bock, senior consultant at The Patricia Seybold Group in Boston. "The existing
market is under assault."
Simple search engines can be had at little to no cost. A routine search of C/Net's Shareware.com Web site turned
up five free text search engines for Windows and Macintosh-based Web servers. Commercial developers must provide
more functionality and features to keep the revenues flowing.
Fortunately, some of the greatest advances in search and retrieval technology have yet to come. Most advances
will be made in the methods search engines use to interpret user input and to apply concepts within the query to
the knowledge in its
indices. Delphi's executive vice president Carl Frappaolo sees the common method of how search software engines
rank results being replaced by a methodology of refining search results on the fly. One existing way to do this
is by reposting examples of what is being searched from the original search into a new search: "Find documents
that resemble this," Frappaolo said.
One company making significant strides in developing new ways of presenting search results to users is InXight
Software Inc., a Palo Alto, Calif.-based Xerox spin-off. InXight's LinguistX software analyzes incoming text, normalizes
words to their standard conical form (making plurals singular) and identifies concepts within the search, such
as "disk access controller."
At a higher level, InXight's tools can create document summaries based on patterns and relevant pieces of the
search query, giving users greater insight into the document's actual content, said Ian Hersey, advanced product
planning manager for InXight.
Currently, the company is only licensing the product to third-party vendors including Verity and SPSS Inc. of
Chicago. A producer of statistical analysis software for market and scientific research, business analysis and
quality process control, SPSS licensed LinguistX for its TextSmart product. TextSmart categorizes free-form (open-ended)
survey answers automatically, instead of a person having to code each response by hand, said Louise Rehling, executive
vice president of product development at SPSS.
TextSmart, with the help of LinguistX, checks the answer for spelling, reduces words to their stem, performs
cluster analysis and creates a coding category which can be later reviewed for accuracy. "The LinguistX product
tries to get all the 'noise' out of the answer by reducing the total number of words down to key words," Rehling
explained.
InXight/Xerox innovation is the hyperbolic tree concept, which returns search results in a visual format rather
than text. The visual format allows the user to see how each of the search results is related to the rest of the
surrounding documents. Color coding enables users to see the density of results in a given area. "We're aiming
to solve everything in one view with the hyperbolic tree," InXight's Hersey said.
Paris-based ERLI is also offering natural language processing in its IR product. IR is a server-based tool that
is said to understand the user's frame of reference -- what the phrase and combination of words mean to them. This
enables the tool to build an optimized query that can be used by the search engine for more accurate retrieval
results. ERLI's TM administrative tool allows users to add words and meanings to the standard lexicon used by the
linguistic server, thus training the server to understand a particular company's or user's language better.
InXight and others are working to find a means for search engines to recognize questions, such as "Who
was the last major league player to hit .400 in a season?" Hersey envisions an engine able to determine the
result should be related to a person that played baseball, at the very least. Enter that search into a modern day
search engine and see what happens.
Until Hersey's vision is realized, look for improvements in relevancy rankings and search refinement. Assorted
vendors are working on advances in neural and fuzzy logic searching. For example, Fulcrum is working on neural
network-based technology for accomplishing automatic subject classifying of documents. This technology should be
available in the first half of 1998, predicts Fulcrum's Haskins. Verity's Weissman sees more query navigation and
categorization techniques in his company's future, all of which should help less technical end users find the impossible.