Text Digitization: Quality and Costs, Steve Chapman, Harvard University
In planning digital text conversion, managers should consult a checklist which includes:
Type or genre of digitized text
What types of digitized text do you need to produce? Page images; page images with hidden text for searching; page images and displayed full text; displayed text? Conventionally, page images-only are common for holographic materials, music scores and some machine-printed material. Page images and “hidden” full text are common for black and white machine-printed text. Encoded text is common for scholarly text initiatives.
Delivery behaviors of digitized text
What do the electronic texts need to do? For example, do users need to view entire pages, read annotated transcriptions, print, browse and/or search? All text behaviors must be encoded. Machines must be able to interpret page sequence, page number and/or document sections. This is accomplished through the creation of structural metadata, which “gathers, sews and binds” page images into manageable and navigable digital objects. Numerous metadata standards have been created to achieve these types of behaviors, including Encoded Archival Description (EAD), Text Encoding Initiative (TEI), Portable Document Format (PDF) and Metadata and Encoding Transmission Standard (METS).
What amount of quality is required for the digitized text?
Image quality metrics include:
What are the planned outcomes for the original objects?
Are original objects going to be disbound before scanning? Will they be cleaned, repaired or conserved? Will original materials be rehoused after scanning? Will film surrogates be made of original objects? All of these considerations affect the types of scanners one must purchase to use in digitization. The least expensive scanners (single page, auto-feed) accommodate disbound, non-fragile, black & white items, whereas more fragile, color or filmed items require more complex and expensive equipment. Scanning costs per page may run from between $0.15 to $6.25 depending upon the requirements.