stacksimage

From Investigation to Implementation

Building a Program
for the Large-Scale
Digitization of Manuscripts

Digitization

This section describes the workflows and guidelines developed for the creation of the image files comprising the Thomas E. Watson Papers Digital Collection.

Image Creation

Image creation for the bulk of the materials in the Thomas E. Watson Papers was conducted using a Zeutschel OS 12000 Bookcopy scanner. One half-time (20 hours per week) graduate student assistant created the digital images between January 2008 and March 2009.

Statistics

Statistics were maintained for the digitization of the materials from the Correspondence Series, which was conducted between 7 January 2008 and 3 June 2008. Total processing time per scan, which includes technical metadata capture, was ~1 minute 32 seconds.* Materials were scanned as TIFF files in color at 400 ppi.**

Total ItemsTotal Linear FeetTotal ScansTotal TimeTime Per Linear FootTime Per Scan
8,4347.515,551~ 400 hrs~5 hrs 20 min1 min 32 sec

The other series in the collection, as well as materials on loan from the Watson-Brown Foundation, were scanned from July 2008 to March 2009. Scan time per image for these materials was considerably shorter, taking an average of 1 minute 8 seconds for image and technical metadata capture. More than 12,300 items were scanned, resulting in over 45,500 image files.

* Scan time statistics do not differentiate between image processing time and technical metadata capture time. The current average time on the Zeutschel includes the time necessary for mounting the item on the scanner, positioning the four sides of the crop boxes on screen (auto-crop was found to be unsatisfactory), scanning, naming the file, capturing technical metadata, and moving it to permanent storage.

** When imaging of the collection commenced, master image files were being captured and stored as TIFFs. Since this time, the SHC has decided the JPEG2000 file format is an acceptable format for long-term preservation, and so images for future large-scale digitization projects will be scanned and stored as JPEG2000 files.

File Naming

File names are created using a meaningful pattern of alphanumeric characters separated by underscores.

The first part of the filename consists of the call number of the collection to which the materials belong. This is also the same number as the <eadid> in the collection's EAD-encoded finding aid. All SHC collections have a call number consisting of five digits or less, so leading zeros are used to pad the number if it consists of less than 5 digits.

The second part of the filename indicates the container type and number in which the original object resides, and the syntax relates to the container labeling and numbering guidelines practiced by SHC technical services. When the container is a regular folder, only the container number is listed, with leading zeros: 0001, 0002, etc. When the container is of a type not 'folder', the prefix dictated by SHC EAD guidelines is used before the container number, separated by an underscore: PF_0001, PF_0002, etc. (for a photograph folder).

The third part of the filename indicates the position of the scanned item in its container, with leading zeros: 0001, 0002, etc.

Example File Names
00755_0007_0003
This would be the root filename for the third image in folder 7 of collection number 755.
00755_PF_0012_0001
This would be the root filename for the first image in picture folder 12 of collection number 755.
00755_SV_0003_0033
This would be the root filename for the 33rd page of oversize volume 3 of collection number 755.
File Storage

Three iterations of each image were created and stored for the Thomas E. Watson Papers Digitization Project. Materials were initially scanned as TIFF images. These files were then converted to lossy JPEG2000 images for Web delivery at a compression rate of 0.06%. JPEG thumbnails were also created from the TIFF files. Batch conversions of the files were conducted using the ImageMagick image manipulation software suite.

The volume of images created by scanning an entire manuscript collection requires a large amount of digital storage space for both the master images as well as the Web delivery images. Over 45,000 images of each file type were created per scan, resulting in a total number of more than 136,160 image files.

File TypeTotal FilesAverage File SizeTotal Storage Requirement
TIFF4538730.84 MB1.35 TB
JPEG2000453872.59 MB115 GB
JPEG453872.4 KB80 MB

More than 2 TB of storage is used to house the image files included in the Thomas E. Watson Papers Digital Collection.

File Storage for Future Projects

For future projects, the SHC has decided to scan materials initially as lossless JPEG2000 images and to retain them as the master files. This will dramatically decrease the amount of long-term storage needed to preserve the master images. Even with the smaller JPEG 2000 files, digitization of huge quantities of manuscripts materials will require a monumental amount of storage capacity. The SHC staff anticipates that the average file size will be 12 MB and that the Digital SHC will require more than a terabyte of additional storage space in the Digital Archive (server space designated for long-term storage of electronic files) as well as a second terabyte on the Web server (server space for the online, publicly available collections) annually.