CodeCrawler in search of developers

A group of developers at the University of Illinois-Urbana Champaign released CodeCrawler, a Web-based search engine tailored for developers to search source code.

The tool, which is available at, is a token-identifying, code-discovering engine that ranks results by relevance for developers trying to find components of source code across local file systems and Web-based code banks. Its core is comprised of components, including Lucene, CTAGs and Highlight, and delivered via an interface.

Administrators install CodeCrawler and configure source code repositories to be searched. CodeCrawler then builds a search index for the source code while analyzing each file and extracting semantic information. Developers search the indexed repositories and examine the code from within a Web browser.

Developers often are burdened with large-scale software development and maintenance. As the code base increases, it becomes more difficult keeping the code and documentation up to date, and fixing existing bugs.

They use grep utilities to find a particular piece of code by searching source files for a match with a regular expression, but grep utilities have disadvantages, according to the university's developers. Writing a regular expression for a search requires at least some knowledge about what is being searched, and the results returned from a grep are all the matches to the given regular expression, but are all relevant. Grep utilities are part of the operating system or integrated within an IDE and the results can't be viewed from the Web.

Although Web search engines aren't as precise as grep utilities, these search engines allow inexact matches, rank the results by relevance and display them in a Web-viewable form. Search engines also compute a relevance score for a particular result based on how many occurrences of the search keywords appear in the results.

CodeCrawler combines features of Web search engines and grep utilities, adding knowledge about programming language syntax and source code semantics to allow searches that more accurately determine the relevance of search results. It provides a Web interface to enable users to submit queries using regular expressions found in grep searches, keywords used in Web searches and special programming specific extensions.

Search results will be ranked by relevance, taking into account source code semantics, such as class, method and variable, and point to the original source code. CodeCrawler will support many programmable languages and will be extended with support for new programming languages.

About the Author

Kathleen Ohlson is senior editor at Application Development Trends magazine.