John Harvard's Journal

Harvard's Googled Library

March-April 2005

In agreements announced December 14, the world's largest search engine, Google, undertook to build an on-line reading room to house digital, searchable versions of millions of books belonging to Harvard, Stanford, the University of Michigan, the New York Public Library, and Oxford University, and to make them freely available to anyone anywhere on the Internet. The collaboration has the potential to create what Harvard librarians call "a revolutionary new information-location tool" and "an important public good."

Harvard began with a pilot project: digitizing 40,000 of the five million books at the Harvard Depository in Southborough, Massachusetts. Google staff began to work at the depository in mid January, using their own proprietary apparatus (which the company will neither describe nor allow to be photographed), and should finish by summer.

Harvard wants to see how the production process goes in the pilot project before agreeing to let Google go further, explains Sidney Verba, Pforzheimer University Professor and director of the Harvard University Library. Although he and his colleagues do not believe that this is risky business, they want to make certain that Google doesn't damage books, or lose them, or keep them out of circulation too long. For its part, Google needs a better sense of the price of wholesale digitization, says Verba. That newly public, cash-flush company will pay all the bills, apart from some support expenses borne by Harvard. Google isn't talking, but others speculate that digitization might cost $10 per book; with earlier technology, costs were far higher. Finally, the laws that apply to digitized books still under copyright are uncertain and changing, and even though many publishers have already entered into agreements with Google about how their books may be used, it is not certain, says Verba, that the grand plan won't be challenged.

Verba has seen Google staffers scan a book and judges the process kind, gentle, and efficient. "We are fairly well convinced the pilot project is going to work," he says. If so, then Google will get going on all 15 million Harvard books, a job that will likely take between five and 10 years.

When the pilot project is done, the first 40,000 books will be virtually browsable. Now, because they are off campus, their content is difficult to assess, even by Harvard users. Then, anyone will be able to go to Google and read the full text of out-of-copyright materials. Probably no text of books under copyright will be displayed at first, but the library anticipates that if the full project goes forward, snippets, or paragraphs, or perhaps even a page or two of copyrighted books will be shown by Google -- enough text to enable a researcher to determine whether to go to a library to see the whole book or to buy a copy.

In the coming era, a student will go to Google to search for books on Ralph Waldo Emerson, let's say, and read the books or sample them on-line. If the student wants to know what Emerson had for breakfast, a keyword search will zip through the books and uncover any reference to his tastes. (Verba is not sure in what order books will be arranged in search results. Perhaps the most popular ones will be at the top of the list. Perhaps some yet-to-be-introduced software, informed by artificial intelligence, will direct a user more effectively to what that user, in particular, really wants.)

Verba says that eventually the Google page shown to users at Harvard will have a link to HOLLIS, the University's on-line library catalog, so that a researcher can easily see what library at Harvard has a wanted book and whether it is available.

A student could approach an Emerson research project in this way, by going first to Google, but Verba recommends starting with HOLLIS. That catalog will show all books about Emerson at Harvard and tell which have been Googleized, providing a link. Moreover, HOLLIS catalogs books too fragile to be scanned, as well as materials in numerous other media -- photographs, for instance -- and so would also reveal that Harvard has Emerson's papers.

Some students today may incline to the heresy that the Internet can safely be their sole source of research information: if you can't learn on the Internet that Emerson liked his eggs sunny side up, the information isn't worth having. The Googleization of Harvard's miles and miles of books, however, may usefully marry the Internet and the library in the mind of any child of the times with a tendency to feel they are divorced.