Google to scan 800,000 manuscripts, books from Indian university

Discussion in 'Main Lounge' started by Jason, May 23, 2007.

  1. Jason

    Jason

    Joined:
    Sep 26, 2005
    Messages:
    2,081
    Likes Received:
    0
    Location:
    Chicago,IL
    Need to dig up some information from a centuries-old text on ayurvedic medicine? Soon you'll be able to do so from the comfort of your living room. Google has agreed to index and digitize 800,000 texts stored at the University of Mysore in India as part of its attempt to broaden the Google Book Search program, according to the Indo-Asian News Service.

    "Written in both papers and palm leaves, there are around 100,000 manuscripts in our library, some dating back to the eighth century," said the vice chancellor of Mysore. "The effort is to restore and preserve this cultural heritage for effective dissemination of knowledge." He also added, cryptically, that the University plans to "patent them before making them available on public domain."

    Google has been aggressively expanding its Book Search program to include non-English library materials. It recently announced a deal with the University of Lausanne to scan a large collection of French-language works, and the new partnership with Mysore will digitize works in Sanskrit and Kannada. These schools lack the fear of Google displayed by the French government, which has so far introduced projects like Gallica and Quaero to challenge the search giant without any apparent success.

    India has become increasingly important to Google in the last few years. The company opened a billion-dollar data center in Andhra Pradesh, and it recently announced the availability of Google News in Hindi. But how willthe might of Google's technologyfare when confronted with handwritten Sanskrit? <H3>How steady is your hand?</H3>

    Making an archive like this useful to scholars will involve using optical character recognition to translate the handwritten texts into searchable characters—and it's a tough task. Our own Jon Stokes has done extensive research in this area and says, "The hard part about doing a project like this lies not so much in the actual digitization of the page images, but in doing OCR on a handwritten script. OCR can work quite well on handwritten manuscript pages, if the handwriting is regular enough. Researchers doing this stuff with Greek manuscripts have gotten some good results, but again only on regular hands."

    Google has developed open-source tools like OCRopus to address these problems. The new project is built on Tesseract, the company's open-source OCR engine, and it adds a handwriting recognizer and "novel high-performance layout analysis methods." The research is clearly of more than academic interest to Google. As it expands its digitization efforts, OCR is the only feasible way to convert handwriting into text on such a massive scale. But the problems go beyond the actual character recognition—storage and markup of the data is also a problem.

    The Text Encoding Initiative (TEI) was founded in 1987 with the aim of providing SGML-compliant, machine-readable texts for humanities scholars and social scientists. The organization's "P3" text encoding guidelines have been in use since 1994 in a range of digital library and manuscript encoding projects, but marking up documents into a TEI-compliant format is a challenge.

    If Google is using the OCRo
     
    Jason, May 23, 2007
    #1
    1. Advertisements

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments (here). After that, you can post your question and our members will help you out.