In-Depth
USAToday.com puts XML to work at the editor's desk
- By Rich Seeley
- December 1, 2002
The newspaper business has always been hectic. In the 1920s, a reporter would
rush out of a courtroom, grab a phone (the new technology of that era), call the
newsroom and scream: ''Get me rewrite!''
The reporter would then dictate the story to a rewrite man, who banged it out
on a manual typewriter. An editor armed with a lead pencil would mark up the
typewritten pages, and add instructions for the typesetter as to where the
paragraphs were. Then he added a headline and handed it to a copyboy who rushed
it to a linotype operator who set the story for the afternoon paper.
The daily news business is no less hectic in 2002. While
editors at USA Today
still mark up stories, they now use the Extensible Markup Language (XML), the
electronic descendent of the old pen and pencil marks that date back to editors
in green eyeshades.
Editors for the USAToday.com Web site are also beginning to work with
software tools that scan stories and automatically compose headlines, write
summaries of the content and list 10 keywords for search engines.
The XML editorial system at USAToday.com, which rolled out this summer, began
with a commitment to implementing XML technology for editing and publishing news
stories, according to Chat Joglekar, information management architect for the
project. The development project, aimed at producing faster, more intelligent
news coverage, began a year-and-a-half ago, he added.
''Before that we didn't store content in XML,'' Joglekar explained. ''We
basically stored it in HTML, which made it hard to understand what a story
encompassed. In HTML, you were always guessing what the headline was, what the
byline was. In XML, we can tag it appropriately and always know exactly what the
headline is, who the author is, the dateline ... all those appropriate
things.''
Because the newspaper business and its online publishing outlets are a
relatively small vertical industry, Joglekar found very little off-the-shelf
software that he could deploy in the new XML-based system.
The applications the editors use to select news items, assign meta data and
publish the final stories on the Web site are all ''homegrown,'' he said. ''It's
pretty specific [to news editing], so we could not really leverage any other
tools that were out there.''
USAToday.com is a Windows 2000 shop running on Intel Pentium-based Dell
hardware. The developers used Microsoft BizTalk server and SQL databases to
capture and store the news stories that are fed in electronically by wire
services. Most of the coding was done using C#, Joglekar explained. Development
began with the COM+ model, but as the Microsoft .NET platform and tools came on
the market early in the project phase, the programming team began using ASP.NET
and more recently Visual Basic .NET, he said.
''The code was primarily VB 6,'' Joglekar said. ''Anything that we currently
rewrite, upgrade or update we're doing in .NET. If we touch a piece of the
system again, it's going to be written in C# or ASP.NET.''
Joglekar said he was able to buy two applications -- XMetal Editor from
SoftQuad, now part of Corel Corp., Ottawa; and XML Categorizer and Concept
Tagger ontology-based tools from Los Angeles-based Applied Semantics Inc. -- and
plug them into the system.
The editors at USAToday.com use XMetal to add XML tags to the wire stories
that come into the system from Reuters, the Associated Press and other news
services, Joglekar said. While there is an XML-based News Markup Language,
called NewsML, USAToday.com created its own DTD that is ''loosely based'' on it,
but which incorporates document definitions specific to the way editors for the
Web site work with stories, he added.
''It's basically our own XML format,'' Joglekar said. ''It incorporates stuff we
use for the meta data we want to capture, including the type of handling we do
-- such as where it will end up on the site.''
Applied Semantics' Categorizer and Concept Tagger tools are used to automate
some of the work copy editors have traditionally done in news organizations.
Based on an ontology of categories and terminology, the tools scan news stories
for key terms and generate headlines, summaries and keyword lists. Applied
Semantics maintains an ontology of millions of terms specifically geared to the
news business, so organizations do not have to customize it, a company
representative explained.
With a knowledge base of ''1.2 million terms categorized into half-a-million
concepts,'' the tools use industry-standard and user-defined taxonomies to
determine the hierarchical relationships between terms in a document and to make
sense of them, explained Gil Elbaz, co-founder and CIO at Applied Semantics.
For example, he said, the ontology and taxonomy technology can distinguish
between the ''java'' from Starbucks and the ''Java'' from Sun Microsystems by noting
that one is related to caffeine and the other to computer programming.
As deployed at USAToday.com, ''[the tools] allowed us to leverage the ontology
to appropriately categorize and add meta data to stories that otherwise would be
hard and time-consuming for the editor to do,'' Joglekar said.
Using a pre-season Denver Broncos vs. San Francisco 49ers Monday Night
Football game, for example, the XML-based Categorizer would generate a headline
such as ''49ers Defeat Broncos'' and a list of keywords, including Denver,
Broncos, San Francisco, 49ers, NFL, professional football, 2002 season, Monday
Night Football, Brian Griese and Jeff Garcia.
This is all work the editors would have had to do themselves by reading the
story and picking out the keywords.
''To come up with those [items], it's kind of the 90/10 rule,'' Joglekar said.
''We can have 90% of the stuff an editor would think of actually submitted by
Applied Semantics and stored in the document. It was a big time savings from an
editorial point of view.''
The system also provides a level of consistency in categorizing that human
editors can't achieve, he noted. For example, one editor might list a story on
Michael Jordan under ''Jordan,'' where it might be confused with the Middle
Eastern country, while another editor might list it as ''MJ,'' where it might get
mixed up with stories on coffee. Applied Semantics's Categorizer uses a
consistent tag that would always associate the term with the basketball star,
Joglekar said.
While it is still too early to determine how effective Applied Semantics will
be in categorizing news stories, he said, the system could potentially automate
the way content is delivered to a news Web site.
''If we feel comfortable that Applied Semantics can accurately categorize a
story -- that part of the process where editorial says, 'This should go to the
NFL Broncos page, this should go to the golf page etc.' -- it [would be] a big
time savings,'' Joglekar said.
''Editorial may look at [the generated information] and confirm that Applied
Semantics made the right choice,'' he continued. ''But that will get the story out
quicker and save editorial time.''
About the Author
Rich Seeley is Web Editor for Campus Technology.