Data glut? Call out the search engines

The advent of the Internet and an explosion of electronic data have made a wealth of information available at the tip of a finger. Information is stored in multiple data sources spanning wide geographical areas, making it difficult at best to find pertinent data quickly.

A simple, single- or double-word query to an Internet-based search engine, such as Digital's AltaVista ( or Yahoo! (, typically returns about 10,000 possible results or "hits" based on an algorithmic scale of relevancy. The good news is that all the possibilities are in the same HTML format, virtually guaranteeing that they will be readable through a browser.

A similar issue is spawning in corporations as they face an overabundance of information. While on a smaller scale,
application development managers attempting to implement search technology face a two-fold problem. First, as with the Internet, the sheer number of documents that need to be searched and/or indexed by a search engine can be daunting. But second, and more important, is the file format and physical location of the data. Documents can take the form of Microsoft Word files, database records, Adobe Acrobat (PDF) files, image files and legacy data -- all located on multiple servers in physically dispersed areas.

Add to that the limited ability of most end users to take what they are thinking about and construct a query to find the information exactly as they are processing it in their minds. A telephone sales representative at a computer hardware distributor trying to find which brands of "disk access controllers" their company carries would have minimal success in discovering the answer by entering those words alone into most search engines. Documents containing the individual words "disk," "access" and "controller" would be returned, but none may have anything to do with the actual device in question.

"'Search technology' is a dangerously ambiguous phrase," commented Curt Monash, president and founder of Elucidate Technologies Inc. and Monash Information Services, both of Lexington, Mass. "These products combine database indexing and management with a user interface, which has recreated the problem that relational databases fixed."

With a relational database, explains Monash, users are limited to searching specifically defined columns of data of a particular type using a structured query, SQL for instance. New search technologies have exacerbated the problem SQL solved, insisting users learn a different type of query language to search a spectrum of


  1. Query accuracy and performance -- algorithms, voting
  2. Ancillary functions -- clustering, summarization, semantic search
  3. Location and index of multiple data types
  4. Push, agents and repository links
  5. Centralized indexing augmented with profiles/taxonomies and crawling
  6. "Extras" -- knowledge mapping, hit highlighting, image search
  7. Price -- target $100 per seat in negotiations

SOURCE: Meta Group Inc., Stamford, Conn.

unstructured data. "[Today's] queries have to be crafted, and it is absurd to think an end user is going to be able to do it or have a template that accomplishes the goal," said Monash. "Templates need to have objects that change depending on the type of query."

Elucidate is attempting to re-correct the flaws of searching by adding subject sensitivity to search engine interfaces. When looking for information on a given company, a search engine has to be aware of whether the company is privately or publicly held, since different information exists depending on the case, illustrated Monash. Ultimately, Monash believes, the best search interface will not let a question be asked that cannot be answered and is based on natural language queries. Still in prototype development, Elucidate is trying to correct the flaws of modern search technology.

In the meantime, there is a variety of decent search technology shipping for the corporate Intranet, Web site or client/server network to help users wade through the glut of information available to them from their desktop. And the technology is improving daily.

With Boston-based Delphi Group estimating a $534 million market up for grabs in the 1997 search technology marketplace, Verity Inc. of Sunnyvale, Calif. and Fulcrum Technologies Inc. of Ottawa are among those leading the charge with a bevy of products for all types of searching needs -- from departmental servers to enterprisewide document retrieval.

As for Wall Street darlings such as Lycos, Framingham, Mass., and InfoSeek, Sunnyvale, Calif., they may have a strong online presence, but have not made inroads into the corporate landscape, said Michael Sullivan-Trainor, program director for Internet Research at Framingham, Mass.-based IDC Research.

Prominent player Verity's core technology is its Development Kit, which provides developers and third-party vendors with a C application programming interface (API) for embedding into their applications. This technology allows developers to both build the necessary indexes as well as search documents in their native format, be it Word, PDF, ASCII or HTML.

"Our knowledge-based architecture allows one to define 'rules of evidence' that are associated with a given subject," said Ron Weissman, Verity's vice president of marketing. The knowledge base helps break search results into hierarchical data sets that are related. These relationships can take the form of a complex taxonomy such as medical terminology, Weissman explained.

Documentum Inc., Pleasanton, Calif., is embedding Verity technology into its Documentum Enterprise Document Management System (EDMS), a suite of tools that manages all the procedures between document creation and reuse. Via its DocPage Server, the EDMS provides services for both documents and Web pages in their native formats, while managing changes to each document and keeping track of its location within the enterprise network. With such document handling capabilities, search integration is a must. In 1993, when the first version of EDMS shipped, Documentum looked at all the available search technologies, said Whitney Martin, a product marketing manager for Documentum. "Verity wanted to start a partnership, which worked well for us at the time," continued Martin.

Documentum uses Verity's technology to create a more robust search based on content or file information attached to a given document, such as attributes and meta data. "Search technology is becoming one of the most important pieces of the architecture," Martin said. "The Web has highlighted that as users are expecting search in any application they use."

Verity typically sells its technology in bulk OEM deals to vendors -- witness the Documentum deal -- for integration into custom applications. One problem Verity faces in terms of marketing is that many end users use their product unknowingly. "Verity is in Netscape and Lotus but users don't experience it as Verity, just as part of the application," said Sullivan-Trainor.

Recently, Verity began offering more out-of-the-box functionality based on its core technology under the Search '97 family of products. The most prominent tool in the group is IntelliServ for Windows NT -- a solution for delivering filtered information to users based on preferences. The package, which starts at $7,995, can notify users via E-mail, pager, ticker or custom start page.

Verity, which had gained considerable mind share in recent years, is feeling pressure. Fulcrum offers SearchServer, a document indexing and retrieval system, SearchBuilder for Java and C++ for customizing searches within applications and a new integrated suite of indexing and retrieval tools called Knowledge Network 2.1, which shipped the beginning of last month.

"We are very focused on adhering to open standards," said Dave Haskins, Fulcrum's vice president of advanced systems research. The company is especially focused on Microsoft standards, using the ODBC and JDBC interfaces as gateways to their products.

Knowledge Network was launched in part as a reaction to the emerging Internet paradigm and also to help enterprises manage ever-changing and moving documents. "Knowledge Network points to the existing information source and automatically indexes and integrates into a single knowledge base," Haskins said.

Fulcrum is targeting Knowledge Network at customers with cross-industry needs, broad distribution organizations, Lotus Notes or Microsoft Exchange environments, file systems and databases containing large amounts of documents, and a desire to have single points of entry for accessing applications, according to Haskins.

Tory Tory DesLauries & Binnington, a large law firm based in Toronto, is now in the pilot phase of a project using Fulcrum Knowledge Network 2.1, as it attempts to make it easier for its 200 lawyers and 450 administrative staff members to search a variety of memos and precedence stored in the company's many databases, most of which are stored in Microsoft Word format. Information and documents used in previous cases can be valuable assets for lawyers during litigation.

According to John Cameron, a partner at Tory Tory DesLauries & Binnington, Knowledge Network's ability to create "virtual files" from disparate data sources made the product an attractive offering. Virtual files allows the law firm to set up groups of databases as a single file, such as "memos," on the users desktop. A user searching the "memos" file does not know the search is actually spanning multiple data sources and types. "We're trying to make it easier for less technically savvy lawyers [and staff members] who may not know everything about the databases," Cameron said.

Cameron compared his firm's Knowledge Network project to that of an Intranet, which the company does not have at the moment. "This tool is a simple way to obtain the advantage of an Intranet without any programming at all," Cameron said.

An Intranet, he added, has to be able to take a user through a subject, almost leading them by the hand, link by link. "That would be a lot of work on both our technical- and legal-side to create and maintain links to guide our users to the documents they need," Cameron said. "Fulcrum provides a down and dirty way to [create an Intranet.]"

Currently, the beta version Knowledge Network 2.1 is operating on a Pentium 60 machine with 64Mb of RAM and serving 30 to 40 users without problem, according to Kevin Wilson, project manager in the firm's software development and research department. "We could probably add a larger number of users on the current machine," Wilson said. "The problem arises when indexing large amounts of documents, which takes more processing power."

Ultimately, Wilson and Cameron believe theirs will be in the neighborhood of five gigabytes of data broken up into multiple virtual files once the system goes into full production, slated for completion by year's end.

Coming at the search technology/ knowledge management space from an imaging angle is Excalibur Technologies Corp., Vienna, Va., and its RetrievalWare product series. The company is using a hybrid of semantic networks (concept searching) and neural networks (fault-tolerant fuzzy searching) to go beyond standard Boolean searches. Much of the company's core technology comes from work it has done over the years for the U.S. government, especially in the image and pattern recognition areas.

"Everyone is moving towards more linguistic and semantic-based searching," said Mark Demers, director of marketing at Excalibur. "There is a greater need for corporations to access data across all assets in the enterprise and all users to retrieve data more intuitively."

Excalibur has three offerings. First, Excalibur RetrievalWare performs standard text and document searching; second, Excalibur Visual RetrievalWare accomplishes the same for images and video based on specific patterns; third, Excalibur Internet Spider can be used for searching multimedia content on the Web. The latter, based on technology it acquired from InterPix Software Corp., Santa Clara, Calif., is designed to be an add-on to the first two products, but can be sold as a stand-alone to compliment an existing data management system.

In October, Excalibur released version 6.5 of its product suite, featuring the RetrievalWare File Room. This latest feature leverages adaptive pattern recognition processing (APRP) and allows fuzzy searches that can overcome optical character recognition and input errors associated with transferring paper-based information into an electronic form, such as misspelled or misinterpreted words.

Analysts and vendors alike agree that the search market is becoming "commoditized" and moving more towards the knowledge management space. "[User] expectations are beginning to outstrip the search engine's ability," said Jeffrey Bock, senior consultant at The Patricia Seybold Group in Boston. "The existing market is under assault."

Simple search engines can be had at little to no cost. A routine search of C/Net's Web site turned up five free text search engines for Windows and Macintosh-based Web servers. Commercial developers must provide more functionality and features to keep the revenues flowing.

Fortunately, some of the greatest advances in search and retrieval technology have yet to come. Most advances will be made in the methods search engines use to interpret user input and to apply concepts within the query to the knowledge in its
indices. Delphi's executive vice president Carl Frappaolo sees the common method of how search software engines rank results being replaced by a methodology of refining search results on the fly. One existing way to do this is by reposting examples of what is being searched from the original search into a new search: "Find documents that resemble this," Frappaolo said.

One company making significant strides in developing new ways of presenting search results to users is InXight Software Inc., a Palo Alto, Calif.-based Xerox spin-off. InXight's LinguistX software analyzes incoming text, normalizes words to their standard conical form (making plurals singular) and identifies concepts within the search, such as "disk access controller."

At a higher level, InXight's tools can create document summaries based on patterns and relevant pieces of the search query, giving users greater insight into the document's actual content, said Ian Hersey, advanced product planning manager for InXight.

Currently, the company is only licensing the product to third-party vendors including Verity and SPSS Inc. of Chicago. A producer of statistical analysis software for market and scientific research, business analysis and quality process control, SPSS licensed LinguistX for its TextSmart product. TextSmart categorizes free-form (open-ended) survey answers automatically, instead of a person having to code each response by hand, said Louise Rehling, executive vice president of product development at SPSS.

TextSmart, with the help of LinguistX, checks the answer for spelling, reduces words to their stem, performs cluster analysis and creates a coding category which can be later reviewed for accuracy. "The LinguistX product tries to get all the 'noise' out of the answer by reducing the total number of words down to key words," Rehling explained.

InXight/Xerox innovation is the hyperbolic tree concept, which returns search results in a visual format rather than text. The visual format allows the user to see how each of the search results is related to the rest of the surrounding documents. Color coding enables users to see the density of results in a given area. "We're aiming to solve everything in one view with the hyperbolic tree," InXight's Hersey said.

Paris-based ERLI is also offering natural language processing in its IR product. IR is a server-based tool that is said to understand the user's frame of reference -- what the phrase and combination of words mean to them. This enables the tool to build an optimized query that can be used by the search engine for more accurate retrieval results. ERLI's TM administrative tool allows users to add words and meanings to the standard lexicon used by the linguistic server, thus training the server to understand a particular company's or user's language better.

InXight and others are working to find a means for search engines to recognize questions, such as "Who was the last major league player to hit .400 in a season?" Hersey envisions an engine able to determine the result should be related to a person that played baseball, at the very least. Enter that search into a modern day search engine and see what happens.

Until Hersey's vision is realized, look for improvements in relevancy rankings and search refinement. Assorted vendors are working on advances in neural and fuzzy logic searching. For example, Fulcrum is working on neural network-based technology for accomplishing automatic subject classifying of documents. This technology should be available in the first half of 1998, predicts Fulcrum's Haskins. Verity's Weissman sees more query navigation and categorization techniques in his company's future, all of which should help less technical end users find the impossible.