XML at UNC: Content Tagging
3.Content Tagging Revisited
Content tagging refers to adding tags that code text for indexing and retrieval. We don't currently use any of this software, but, as part of NCEAD, we eventually will.
The most commonly used content tags are:
<persname> <geogname> <corpname> <famname> <subject> <genreform>
<title> is also a content tag, but it has a dual identity in that it's also used for formatting titles (italic or within quotation marks). Therefore, although technically a content tag <title> can be used wherever its formatting function is needed.
In preparing its EAD 2002 guidelines, NCEAD discussed content tagging thoroughly and came to the conclusion that extreme tagging (which we've often done) is not a good idea because:
it takes too much time; it can affect indexing in nasty ways.
The first reason is just a true fact. The second reason has to do with how indexing software weighs significance. You don't necessarily want a collection to rise to the top of a hit list because of content tag overuse. For example, a small collection with important information about Chapel Hill history could have maybe five occurrence of <geogname>Chapel Hill, N.C.</geogname>, while a collection of 700 audio tapes produced at a studio that happened to be located in Chapel Hill might have <geogname>Chapel Hill, N.C.</geogname> 700 times. The indexing software will think Chapel Hill's more important in the latter than in the former.
Therefore, in an effort to streamline processing time (every little bit counts) and to level the potential indexing playing field a bit, we're going to tag a bit less. In general:
- never use content tags in the <abstract> (they're not legal there) or <bioghist> (they can yield false hits there);
- always tag <controlaccess> terms (you may chose to leave this tagging to the cataloger);
- always tag the collection-level <scopecontent> (this tagging can again be left for the cataloger, but you should write the <scopecontent> in a content-tag-aware way as explained elsewhere in this manual) (see Abstracts and Collection Overviews);
- tag the collection's creator only in the collection-level <scopecontent> (Collection Overview) and there only once;
- in other <scopecontent>s or in folder/item lists, tag the first occurrence of a name, subject, or term per series/subseries/list (e.g., the first time that name, subject, or term appears in the Series 1 <scopecontent> or the first time that name, subject, or term appears in the list of audiotapes in Series 3).
Here's what the NCEAD Best Practices Guidelines 2002 say about encoding granularity and content tagging (note that we will not be using the normal attribute tagging discussed in bullet three):
- The "granularity of encoding" of a finding aid refers to the amount of effort expended in the application of subject terms, linking, and other elements which, while not necessary for a complete and valid EAD document, may be applied to enhance searchability.
- Thorough tagging of content within container lists is an important but time-consuming and expensive endeavor. The benefits of tagging to this level of granularity are unclear. NCEAD recommends tagging each applicable content term once that does not occur in the high-level <controlaccess> section. Each term tagged should be considered integral to understanding the breadth and context of the collection. At a minimum level, content tagging MUST be used in the high-level <controlaccess>…
- In addition to the use of these content tags in the <controlaccess> section of the EAD instance, NCEAD recommends that high-level <scopecontent>s include detailed content tagging. If an institution includes content tagging in these high-level elements, such tagging should be done consistently for all encoded finding aids from the institution.
- The goal of content tagging is to assist future searching and indexing of finding aids. It is foreseeable that a search protocol would rank a finding aid higher or more relevant if a content term were tagged on multiple occasions within a finding aid. Therefore we must be careful to not overstate the significance of any particular finding aid by using content tags throughout every section of the finding aid. If content tagging is employed consistently for all finding aids from an institution or institutions, fewer false leads will be encountered by researchers.
- Guidelines for content tagging include:
- Tag only those terms that have significance for the collection.
- Tag the first instance of a particular term only once within each “section” (i.e. collection level or series level <scopecontent>)
- normal attributes can be used to offer normalized forms of names or terms, but remember this can be very time-consuming and is not easily automated. In addition, if the normal attribute is used, a source attribute should be employed to indicate the controlled vocabulary use.