In-Depth

USAToday.com puts XML to work at the editor's desk

The newspaper business has always been hectic. In the 1920s, a reporter would rush out of a courtroom, grab a phone (the new technology of that era), call the newsroom and scream: ''Get me rewrite!''

The reporter would then dictate the story to a rewrite man, who banged it out on a manual typewriter. An editor armed with a lead pencil would mark up the typewritten pages, and add instructions for the typesetter as to where the paragraphs were. Then he added a headline and handed it to a copyboy who rushed it to a linotype operator who set the story for the afternoon paper.

The daily news business is no less hectic in 2002. While editors at USA Today still mark up stories, they now use the Extensible Markup Language (XML), the electronic descendent of the old pen and pencil marks that date back to editors in green eyeshades.

Editors for the USAToday.com Web site are also beginning to work with software tools that scan stories and automatically compose headlines, write summaries of the content and list 10 keywords for search engines.

The XML editorial system at USAToday.com, which rolled out this summer, began with a commitment to implementing XML technology for editing and publishing news stories, according to Chat Joglekar, information management architect for the project. The development project, aimed at producing faster, more intelligent news coverage, began a year-and-a-half ago, he added.

''Before that we didn't store content in XML,'' Joglekar explained. ''We basically stored it in HTML, which made it hard to understand what a story encompassed. In HTML, you were always guessing what the headline was, what the byline was. In XML, we can tag it appropriately and always know exactly what the headline is, who the author is, the dateline ... all those appropriate things.''

Because the newspaper business and its online publishing outlets are a relatively small vertical industry, Joglekar found very little off-the-shelf software that he could deploy in the new XML-based system.

The applications the editors use to select news items, assign meta data and publish the final stories on the Web site are all ''homegrown,'' he said. ''It's pretty specific [to news editing], so we could not really leverage any other tools that were out there.''

USAToday.com is a Windows 2000 shop running on Intel Pentium-based Dell hardware. The developers used Microsoft BizTalk server and SQL databases to capture and store the news stories that are fed in electronically by wire services. Most of the coding was done using C#, Joglekar explained. Development began with the COM+ model, but as the Microsoft .NET platform and tools came on the market early in the project phase, the programming team began using ASP.NET and more recently Visual Basic .NET, he said.

''The code was primarily VB 6,'' Joglekar said. ''Anything that we currently rewrite, upgrade or update we're doing in .NET. If we touch a piece of the system again, it's going to be written in C# or ASP.NET.''

Joglekar said he was able to buy two applications -- XMetal Editor from SoftQuad, now part of Corel Corp., Ottawa; and XML Categorizer and Concept Tagger ontology-based tools from Los Angeles-based Applied Semantics Inc. -- and plug them into the system.

The editors at USAToday.com use XMetal to add XML tags to the wire stories that come into the system from Reuters, the Associated Press and other news services, Joglekar said. While there is an XML-based News Markup Language, called NewsML, USAToday.com created its own DTD that is ''loosely based'' on it, but which incorporates document definitions specific to the way editors for the Web site work with stories, he added.

''It's basically our own XML format,'' Joglekar said. ''It incorporates stuff we use for the meta data we want to capture, including the type of handling we do -- such as where it will end up on the site.''

Applied Semantics' Categorizer and Concept Tagger tools are used to automate some of the work copy editors have traditionally done in news organizations. Based on an ontology of categories and terminology, the tools scan news stories for key terms and generate headlines, summaries and keyword lists. Applied Semantics maintains an ontology of millions of terms specifically geared to the news business, so organizations do not have to customize it, a company representative explained.

With a knowledge base of ''1.2 million terms categorized into half-a-million concepts,'' the tools use industry-standard and user-defined taxonomies to determine the hierarchical relationships between terms in a document and to make sense of them, explained Gil Elbaz, co-founder and CIO at Applied Semantics.

For example, he said, the ontology and taxonomy technology can distinguish between the ''java'' from Starbucks and the ''Java'' from Sun Microsystems by noting that one is related to caffeine and the other to computer programming.

As deployed at USAToday.com, ''[the tools] allowed us to leverage the ontology to appropriately categorize and add meta data to stories that otherwise would be hard and time-consuming for the editor to do,'' Joglekar said.

Using a pre-season Denver Broncos vs. San Francisco 49ers Monday Night Football game, for example, the XML-based Categorizer would generate a headline such as ''49ers Defeat Broncos'' and a list of keywords, including Denver, Broncos, San Francisco, 49ers, NFL, professional football, 2002 season, Monday Night Football, Brian Griese and Jeff Garcia.

This is all work the editors would have had to do themselves by reading the story and picking out the keywords.

''To come up with those [items], it's kind of the 90/10 rule,'' Joglekar said. ''We can have 90% of the stuff an editor would think of actually submitted by Applied Semantics and stored in the document. It was a big time savings from an editorial point of view.''

The system also provides a level of consistency in categorizing that human editors can't achieve, he noted. For example, one editor might list a story on Michael Jordan under ''Jordan,'' where it might be confused with the Middle Eastern country, while another editor might list it as ''MJ,'' where it might get mixed up with stories on coffee. Applied Semantics's Categorizer uses a consistent tag that would always associate the term with the basketball star, Joglekar said.

While it is still too early to determine how effective Applied Semantics will be in categorizing news stories, he said, the system could potentially automate the way content is delivered to a news Web site.

''If we feel comfortable that Applied Semantics can accurately categorize a story -- that part of the process where editorial says, 'This should go to the NFL Broncos page, this should go to the golf page etc.' -- it [would be] a big time savings,'' Joglekar said.

''Editorial may look at [the generated information] and confirm that Applied Semantics made the right choice,'' he continued. ''But that will get the story out quicker and save editorial time.''

About the Author

Rich Seeley is Web Editor for Campus Technology.