Major errors prompt questions over Google Book Search's scholarly value

Google Book Search's mistakes provoke questions about its scholarly value. Matthew Reisz reports

九月 10, 2009

Twitter:

It should be the world's greatest scholarly resource, but some claim that Google Book Search's many huge - and often hilarious - errors raise major questions about its value to serious researchers.

Why does a link to a book on cosmology by a Napoleonic mathematician lead to a novel by Barbara Taylor Bradford? Could Sigmund Freud really be one of the authors of The Mosaic Navigator: The essential guide to the Internet Interface? And how did Barack Obama publish 29 books before he was born?

The journal Speculum is about the Middle Ages rather than gynaecological instruments, so why is it listed under "Health & Fitness"? And why on earth is a French translation of Hamlet classified under "Antiques & Collectibles"?

Even stranger, there seems to be something special about the year 1899, with Google claiming that a novel by Stephen King, a biography of Bob Dylan, a Portuguese version of the Beatles' film Yellow Submarine - and dozens of almost equally implausible titles - were all published then.

Such grotesque mistakes were pointed out by the linguist Geoffrey Nunberg, adjunct full professor at the University of California at Berkeley's School of Information, at its recent conference, "The Google Book Settlement and the Future of Information Access".

Mark Liberman, trustee professor of phonetics at the University of Pennsylvania, made a similar case. A self-proclaimed "enthusiast" for Google Books, he knew it would revolutionise his own discipline - the history of the English language - by hugely increasing the amount of textual material easily available for analysis, "with a potential effect comparable to the invention of the telescope or the microscope".

It remained crucial for scholars, however, that "basic bibliographic information - who wrote what, when - is almost always correct", he said. He added that he was sceptical about how soon the errors would be sorted out. Since such information "may not matter much to ordinary search customers, there is little incentive for Google to fix it", he said.

Professor Nunberg was even more outspoken in a blog posted on 29 August. With Google likely to become "the universal library for a long time to come", scholars need good metadata. Unfortunately, Google's information is "a train wreck: a mish-mash wrapped in a muddle wrapped in a mess".

The posting led to a long reply by Jon Orwant, who has the unenviable task of "managing the Google Books metadata team".

He cheerfully admits to some additional errors, such as an edition of Charles Dickens' A Christmas Carol dated to 1135 - three centuries before Johannes Gutenberg introduced the printing press to Europe.

He is also frank about the scale of the glitches still to be ironed out: "Geoff refers to us having hundreds of thousands of errors. I wish it were so. We have millions ... When you're dealing with a trillion metadata fields, one-in-a-million errors happen a million times over."

The glut of books "published" in 1899 is explained by a Brazilian metadata provider, which strangely uses that year as a default setting when it doesn't know the true date.

Nonetheless, Google is struggling to put things right. "Geoff's efforts will have singlehandedly improved nearly one million metadata records in our repository," Dr Orwant says.

Researchers will be keeping a close eye on whether they manage to solve some pretty monumental teething problems.

matthew.reisz@tsleducation.com.

请先注册再继续

为何要注册？

注册是免费的，而且十分便捷
注册成功后，您每月可免费阅读3篇文章
订阅我们的邮件

色盒直播