Born Digital Preservation Metadata Documentation

From ArchProc
Jump to: navigation, search

Introduction During the Spring 2008 Semester, Jennifer Joyner and Joyce Chapman developed a plan to better manage preservation metadata for oral histories at the Southern Historical Collection (SHC). This project, which was done in conjunction with the Carolina Digital Library and Archives (CDLA), includes the following:

  • Born Digital Preservation Metadata Spreadsheet <ref> Spreadsheet still in production.</ref>
  • Born Digital Preservation Metadata Data Dictionary to accompany spreadsheet
  • MODS Record and Mapping
  • METS Record and Mapping
  • MADS Record and Mapping

This documentation will explain the thought processes behind each of the components listed above. At the conclusion of this documentation is a list of resources used during this project.

Born Digital Preservation Metadata Spreadsheet One of the most important aspects of this project was to develop a solid plan for tracking preservation metadata for born digital audio associated with the Southern Oral History Program (SOHP) collection at the SHC. With the guidance of Steve Weiss, we determined that it was necessary to track the following for born digital audio [2] : interview call number (Call number); sample rate of the audio (Sample rate); audio sample size (Sample size); number of channels (Channels); duration of the audio (Duration); type of recording devise used (Recording device); if an access copy is available (Access copy), and if so, its file type (Access file type); if the access file has been altered in any way (Access file processed); checksum (Checksum); original format of the audio (Original format), and if the interview was conducted by the SOHP (SOHP/non). In addition, the spreadsheet lists if the following are available: digital transcript (e-Transcript), digital abstract (e-Abstract), digital field notes (e-FieldNotes), digital tape log (e-TapeLog), digital deposit sheet (e-Deposit Sheet), digital photos (e-Photos), digital release form (e-Release), digital life history form (e-Life History), and digital supplementary materials (e-SuppMat).

Metadata for the audio files can be found by viewing the properties of the audio file. In addition, audio software such as WaveLab and Sound Forge provide this data.[3]

Information regarding the existence of certain types of digital files can be determined by viewing the files available in the dark archive. The paths for the dark archive and all digital materials are available in the data dictionary.

Currently, we are in the process of calculating checksums for all of the born digital audio. Checksums are calculated using MD5 Hasher. Instructions for using this software are located on the wiki.

Born Digital Preservation Metadata Data Dictionary The Born Digital Preservation Metadata Data Dictionary correlates with the Born Digital Preservation Metadata Spreadsheet. It explains each of the fields listed on the spreadsheet and lists instructions on how to identify and track the information listed in each of the fields. For more detail, please see the Born Digital Preservation Metadata Data Dictionary.

MODS Record and Mapping Each interview in the Southern Oral History Program (SOHP) collection will have a corresponding Metadata Object Description Schema (MODS) record. Developed by the Library of Congress’ Network Development and MARC Standards Office, MODS is an XML schema designed for bibliographic data.[4] The Library of Congress designed MODS for library use. Its potential uses include the following:

  • As an SRU specified format
  • As an extension schema to METS (Metadata Encoding and Transmission Standard)
  • To represent metadata for harvesting
  • For original resource description in XML syntax
  • For representing a simplified MARC record in XML
  • For metadata in XML that maybe packaged with an electronic resource[5]

We decided to use MODS for this particular project because of its element set, compatibility with METS, and helpful documentation. The sample MODS record was created using schema version 3.3; it will be necessary to revise the accompanying MODS record should an updated schema become available.

The sample MODS record is separated into five sections. The first, bibliographic information regarding the interview, includes the following: <identifier>, <titleInfo>, <originInfo>, <language>, <name(of interviewer)>, <name(of interviewee)>, <abstract>,<subject authority= “docsouth”>, <subject authority=“lcsh”> and <subject authority=“sohp”>. The second section, which is included as a related item, describes the transcript. It includes the following within <relatedItem>: <typeOfResource>, <location>, <physical description> and <accessCondition>. The third section describes the analog audio. It includes the following within <relatedItem>: <typeOfResource>, <location>, <physical description> and <accessCondition>. The fourth and fifth sections, also within <relatedItem>, list the existence of a life history form or field notes for that interview.[6]

It is important to note that all interviews available in the SOHP collection do not have digital components.[7] Because of this, it is not possible to create METS records for each of the interviews. Once a digital component for an interview becomes available, a METS record will be created. All METS records will point to the corresponding MODS record for that particular interview.

METS Record and Mapping As noted above, METS records will be created for interviews when a digital component becomes available. The first METS records will be created by the CDLA for the 500 interviews digitized for the “Oral Histories of the American South” collection at Documenting the American South. Once the CDLA determines the specifications for the METS records, this documentation will be amended so that the sample MODS and METS records included in this project comply with their best practices guidelines.[8] The Library of Congress defines METS as a “standard for encoding descriptive, administrative, and structural metadata regarding objects within a digital library.”[9] An initiative of the Digital Library Federation, it is being maintained by the Library of Congress Network Development and MARC Standards Office.

The METS schema has four major sections: descriptive metadata, administrative metadata, file groups, and structural map. Each section is described below.

Descriptive Metadata The Descriptive Metadata section <dmdsec> includes descriptive metadata regarding the digital object. The descriptive metadata can be expressed using a current standard, such as MODS, MARC, Dublin Core, TEI Header, EAD, VRA, FGDC. The <dmdsec> can have both internal and external descriptive metadata. For this project, the <dmdsec> points to external metadata. The METS record points to the corresponding MODS record using a URI.

Administrative Metadata The Administrative Metadata section <amdSec> pertains to the files comprising a digital library object and to the original source material used to create the object. There are four main parts to the <amdSec>: technical metadata <techMD>, intellectual property rights <rightsMD>, source metadata <sourceMD>, and digital provenance metadata <digiprovMD>. A METS record can include multiple instances of the <amdSec> element and all sub-sections. Administrative metadata can be expressed using current standards, including those endorsed by the METS Editorial Board, such as MIX, NISO, and TextMD. We made the following decisions regarding the four parts of the <amdSec>:

Technical Metadata: For the <techMD> section, we chose to encode the technical metadata using AudioMD. AudioMD is currently under review by the Library of Congress.

Rights Metadata: We have chosen to store a rights statement in a separate file and then link to METS. Once this rights statement is finished, it will be linked to the METS record.

Source Metadata: The source metadata section includes descriptive and administrative metadata regarding the analog source from which a digital library object derives. We used two <sourceMD> sections in our METS record. The first section, which describes the source audio, is encoded using MODS. The second section, which is also encoded using MODS, describes the analog transcript. Our source metadata sections describe the digital audio and transcript, as the analog versions are described in the descriptive metadata section (MODS). [10]

Digital Provenance Metadata: We chose not to include digital provenance metadata in our METS record.

File Groups The file group section <fileGrp> lists together all related files. It groups all of the files that comprise a single electronic version of the digital library object. For this project, we had the four following types of digital files: digital preservation master, digital access copy, xml transcript and html transcript. The <fileGrp> is still in progress and will be completed once the CDLA issues its requirements for METS documents.

Structural Map The structural map <structMap> organizes the content available in the <fileSec>. It includes an attribute that allows the creator to specify what type of division is represented (physical or intellectual). The <structMap> is still in progress and will be completed once the CDLA issues its requirements for METS documents.

MADS Record and Mapping For many of the interviews available at the SHC, contextual information is collected about the interviewees. This information includes name, birth date, gender, ethnicity, and occupation. The SHC determined that it would be helpful to have this data in a metadata schema. After considering both Encoded Archival Context (EAC) and Metadata Authority Description Schema (MADS), we chose to use MADS because it is a Library of Congress standard.

The Library of Congress’ Network Development and MARC Standards Office developed MADS to serve as a schema for metadata regarding agents, events, and terms.[11] MADS is currently in version 1.0, and the sample record for this project was developed using this version.

When developing the MADS record, we decided to dissect the name into its parts: family and given. We include other forms of the interviewee’s name within <variant type=“other”>. The contextual information about the interview is included using <note> with type =“history”. According to the MADS sample record provided by the Library of Congress, it is also possible to include the name in one line with last name first. We decided against using this format for the name.

It is important to note that the SOHP database currently lists the interviewee’s name in one field rather than in two. Because of this, mapping from the SOHP database will be not be possible. The same is true for the interviewers’ names.

Before creating MADS records for each of the interviewees, it will be necessary to research the necessity of name authority for MADS records. Name authority work has not been done for the majority of the interviewee names present in the oral histories; however, name authority work is currently underway for the 500 interviews digitized for the Oral Histories of the American South collection. If MADS requires that name authority records exist for all names in MADS, then the decision to use MADS will need to be reconsidered.


California Digital Library. “Inside CDL: CDL Guidelines for Digital Objects.” 26 September 2007.

“Encoded Archival Context.” 29 November 2004.

Indiana University Digital Library Program. “Sound Directions: Digital Preservation and Access for Global Audio Heritage.” 15 April 2008.

Library of Congress. “AudioMD Extension Schema Data Dictionary.” 25 February 2003.

Library of Congress. “ MADS: Metadata Authority Description Schema Official Web Site.” 2 February 2007.

Library of Congress. “METS: Metadata Encoding & Transmission Standard Official Web Site.” 28 April 2008.

Library of Congress. “METS Overview and Tutorial.” 14 June 2001.

Library of Congress. “MODS: Metadata Object Description Schema Official Web Site.” January 24, 2008.

METS Editorial Board. METS: Primer and Reference Manual. September 2007. Available online:

Yearl, Stephen. “EAC: Encoded Archival Context.” 3 March 2003.


[1]Spreadsheet still in production.

[2]Note: the field that corresponds to each of the categories is listed in parenthesis.

[3]When first beginning this project, we considered using Indiana University’s Sound Directions Audio Technical Metadata collector (ATMC) software. This tool was appealing because of its creation of an xml document that included much of the technical metadata we wished to collect. In addition, it created a MD5 checksum for each audio file. We found, however, that this tool only worked with wave files. In addition, the software was still in the development state when we started this program. Thus, we decided against using this tool. For more information regarding Sound Directions and our experiences with the ATMC software, please see the wiki.

[4] Library of Congress, “MODS: Uses and Features.” 8 January 2008.

[5]Bulleted list taken from the “MODS: Uses and Features” page of the MODS official website. For more information, see:

[6] For more information on these fields, please see the MODS mapping included with this documentation.

[7] There has been some discussion regarding what counts as a “digital component.” If an interview has an electronic release form, will a METS record be created for that interview? If it is decided that interviews are put into METS only after they have digital audio or an electronic transcript, it will be necessary to add additional fields to the MODS records to represent the following: digital abstracts, digital field notes, digital tape logs, digital photos, digital life history forms, and digital supplementary materials. Currently, we only track the e-release form and e-deposit sheet in the MODS record.

[8]We created two METS mappings for this project. The first pulls from the DocSouth database. The second pulls from the SOHP database and spreadsheet.

[9]Library of Congress, “METS Metadata Encoding & Transmission Standard Official Web Site.” 28 April 2008.

[10]I am unsure if we are using the Source Metadata section correctly. It is possible that I will change this section soon.

[11]Library of Congress, “MADS: Metadata Authority Description Schema Official Web Site.” 2 February 2007.

Personal tools