innovation in metadata design, implementation & best practices

User Guide Working Draft

Title:

User Guide Working Draft

Creator:
Date Issued:
1998-02-19
Identifier:
Replaces:
Is Replaced By:
Latest Version:
Status of Document:
This is a DCMI Working Draft.
Description of Document: A user guide for simple Dublin Core.

A USER GUIDE FOR SIMPLE DUBLIN CORE

DRAFT VERSION 3.1

1. INTRODUCTION

1.1. What is metadata?

Metadata is, simply stated, a description of an information resource. The term "meta" derives from the Greek word for change; "metadata," then, is data that documents the origins of, and/or tracks the change or use of, data. Metadata may be used for a variety of purposes: to identify a resource to meet a particular information need; to evaluate the quality or fitness for use of such a resource; to track the characteristics of a resource for subsequent maintainence or usage over time; and so on. The variety of metadata standards in use today by different communities of users are designed for some or all of these purposes.

A metadata record consists of a set of attributes, or elements, necessary to describe the resource in question. For example, a metadata system common in libraries -- the library catalog -- contains a set of metadata records with elements that describe a book or other library item: author, title, date of creation or publication, subject coverage, and the call number specifying location of the item on the shelf.

The linkage between a metadata record and the resource it describes may take one of two forms:

  1. elements may be contained in a record separate from the item, as in the case of the library's catalog record; or
  2. the metadata may be embedded in the resource itself.

Examples of embedded metadata that is "carried along with the resource itself" include the Cataloging In Publication (CIP) data printed on the verso of a book's title page; or the TEI header in an electronic text. Many metadata standards in use today, including the Dublin Core standard, do not prescribe either type of linkage, leaving the decision to each particular implementation.

Although the concept of metadata predates the Internet and the Web, worldwide interest in metadata standards and practices has exploded with the increase in electronic publishing and digital libraries, and the concommittant "information overload" resulting from vast quantities of undifferentiated digital data avaialble online. Anyone who has attempted to find information online using one of today's popular Web search services has likely experienced the frustration of retrieving hundreds, if not thousands, of "hits" with little ability to refine or make a more precise search. The widescale adoption of descriptive standards and practices for electronic resources will improve retrieval of relevant resources from the "Internet commons." As noted by Weibel and Lagoze, two leaders in the field of metadata development:

The association of standardized descriptive metadata with networked objects has the potential for substantially improving resource discovery capabilities by enabling field-based (e..g, author, title) searches, permitting indexing of non-textual objects, and allowing access to the surrogate content that is distinct from access to the content of the resource itself." (Weibel and Lagoze, 1997)

It is this need for "standardized descriptive metadata" that the Dublin Core addresses.

1.2. What is the Dublin Core?

The Dublin Core metadata standard is a simple yet effective element set for describing a wide range of networked resources. The Dublin Core standard comprises fifteen elements, the semantics of which have been established through consensus by an international, cross-discplinary group of professionals from librarianship, computer science, text encoding, the museum community, and other related fields of scholarship.

The Dublin Core element set is outlined in Section 4. Each element is optional and may be repeated. Each element also has a limited set of qualifiers, attributes that may be used to further refine (not extend) the meaning of the element. The present guide explores "simple" Dublin Core; more information on qualifiers and their use will be covered in A User Guide for Qualified Dublin Core [not yet available]

Although the Dublin Core favors document-like objects (because traditional text resources are fairly well-understood), it can apply to other resources as well. Its suitability for use with particular non-document resources will depend to some extent on how closely their metadata resembles typical document metadata and also what purpose it is intended to serve.

Dublin Core has as its goals the following characteristics:

Simplicity of creation and maintenance
The Dublin Core element set has been kept as small and simple as possible to allow a non-specialist to create simple descriptive records for information resources easily and inexpensively, while providing for effective retrieval of those resources in the networked environment.
Commonly understood semantics
Discovery of information across the vast commons of the Internet is hindered by differences in terminology and descriptive practices from one field of knowledge to the next. The Dublin Core can help the 'digital tourist' -- a non-specialist searcher -- find his or her way by supporting a common set of elements, the semantics of which are universally understood and supported. For example, scientists concerned with locating articles by a particular author, and art scholars interested in works by a particular artist, can agree on the importance of a "creator" element. Such convergence on a common, if slightly more generic, element set increases the visibility and accessibility of all resources, both within a given discipline and beyond.
International scope
Although the specific linguistic challenges of the World Wide Web have not been directly addressed by the Dublin Core development community, the involvement of representatives from almost every continent has ensured that the development of the standard considers the multilingual and multicultural nature of the electronic information universe.
Extensibility
While balancing the needs for simplicity in describing digital resources with the need for precise retrieval, Dublin Core developers have recognized the importance of providing a mechanism for extending the DC element set for additional resource discovery needs. It is expected that this need for extensibility will be met by additional metadata packages, created and administered by another community of metadata experts, that can be linked to a Dublin Core metadata record. Do we want to say something about RDF at this point?

1.3. The purpose and scope of this guide

This document is intended to help non-specialists create simple descriptive records for information resources (for example, electronic documents). Creators of these records include authors, editors, and World-Wide Web (WWW) site administrators.

The guide will show in a non-technical fashion how Dublin Core metadata may be used by anyone to make their material more accessible. This guide discusses the layout and content of Dublin Core Metadata elements, and how to use them in composing a complete Dublin Core Metadata record.

Another important goal of this document is to promote "best practices" for describing resources using the Dublin Core element set. The Dublin Core community recognizes that consistency in creating metadata is an important key to achieving complete retrieval and intelligible display across disparate sources of descriptive records. Inconsistent metadata effectively hides desired records, resulting in uneven, unpredictable or incomplete search results.


2. Why HTML?

In this guide, we have chosen to represent Dublin Core records in HTML, the Web's Hypertext Markup Language format, because this is the area in which the underlying concepts may most easily be demonstrated at the present time. It is important to note, however, that Dublin Core concepts are equally applicable to virtually any file format, as long as the metadata is in a form suitable for interpretation both by the search engines and by human beings.

HTML has two tags that can be used to capture metadata. These are the " and " tags. If creating metadata that will be embedded, or appear alongside, an actual document these tags must appear within the HEAD part of the HTML document. For example:

"<HTML"> "<HEAD"> "<TITLE">Mating Habits of the Northern Hairy Nosed Wombat"</TITLE"> "<META NAME= "DC.creator" CONTENT="Smythe, Pearl""> "<LINK REL="http://www.tu.edu.au/~psmythe/""> "</HEAD"> "<BODY"> "<H1">Northern Hairy Nosed Wombats"</H1"> "<P"> The Northern Hairy Nosed Wombat is an animal native to Australia...."</P"> "</BODY"> "</HTML">

Indexing programs understand that the metadata record starts after the """ line and ends before the """ line, and are thus able to extract metadata automatically. The metadata does not appear during normal document formatting or printing, and metadata-aware Web browsers may even be able to exploit it. A number of the current search engines have begun to include the ability to make use of the HTML tag in Web documents.

In HTML, each record element definition begins with "". Within the META tag, two attribute/value pairs (as found in other HTML tags) are used to define the metadata. The first is generally NAME, the second, CONTENT. These two work together to define the metadata within the META tag.

This document will not cover the use of the LINK tags.

Below are some examples of how the META tag might be used in stand-alone and embedded metadata. Note that each metadata definition happens to fit on one line, but in general a definition can span several lines.

2.1. Stand-Alone Metadata

Stand-alone metadata can exist in any kind of database. This example describes a photograph in another file that has a location given by a Uniform Resource Locator (URL). The entire record file looks like this:

<META name="DC.title" content="Kita Yama (Japan)"> <META name="DC.creator" content="Kertesz, Andre"> <META name="DC.date" content="1968"> <META name="DC.type" content="e/photograph"> <META name="DC.format" content="GIF"> <META name="DC.identifier" content="http://foo.bar.zaf/kertesz/kyama">

2.2. Metadata Contained in a Resource

The next example is of a metadata record contained in a file along side the document that it describes. The document is a short poem expressed in HTML, the Web's Hypertext Markup Language [3].

<HTML> <HEAD> <TITLE>Song of the Open Road</TITLE> <META name="DC.title" content="Song of the Open Road"> <META name="DC.creator" content="Nash, Ogden"> <META name="DC.type" content="e/document"> <META name="DC.date" content="1939"> <META name="DC.format" content="HTML"> <META name="DC.identifier" content="http://www.poetry.com/nash/open.html"> </HEAD> <BODY><PRE> I think that I shall never see A billboard lovely as a tree. Indeed, unless the billboards fall I'll never see a tree at all. </PRE></BODY> </HTML>

3. Basic Principles of Descriptive Elements

The notation (one of several) described in this guide is based on the HTML META tag. The character set assumed is standard UNICODE with the UTF-8 encoding. This allows for a very wide range of writing systems while remaining compatible with the traditional ASCII character set.

3.1. Element Parts and Syntax

As demonstrated in Section 2, each descriptive element definition has a NAME part and a CONTENT part, as in:

<META name="DC.author" content="Browning, Elizabeth">

Any metadata element may be omitted or repeated. When repeating elements, it is recommended best practice to list each element definition separately, as in:

<META name="DC.author" content="Marx, Karl"> <META name="DC.author" content="Engels, Friedrich">

However, it is also valid to express repeated elements by sharing the name part between elements that are the same, and, using a semi-colon as a delimiter, clustering the content values of these elements under a single CONTENT part, as in:

<META name="DC.author" content="Marx, Karl ; Engels, Friedrich">

A Proposed Convention for Embedding Metadata in HTML agreed upon a convention for identifying and grouping metadata schemes in HTML. This convention relies on the use of a prefix to indicate that the elements used are from Dublin Core or another metadata scheme. For increased readability the prefix "DC" should be written in upper case letters and element names should be written in lower case letters. For example:

META NAME = "DC.title" META NAME = "DC.author"

NOT

DC.CREATORor dc.CREATORor DC.Creator

3.2. Element Content and Controlled Vocabularies

Content data for some elements may be restricted to a "controlled vocabulary," which is a limited set of consistently used and carefully defined terms. This can dramatically improve search results because computers are good at matching words character by character but weak at understanding the way people refer to one concept using different words (having very different characters). Without basic terminology control, inconsistent or incorrect metadata can profoundly degrade the quality of search results.

One cost of a controlled vocabulary is in needing an administrative body to review, update, and disseminate the vocabulary. For example, the US Library of Congress Subject Headings (LCSH) and the US National Library of Medicine Medical Subject Headings (MeSH)are formal vocabularies indispensable for searching rigorously cataloged collections and both require significant support organizations. Another cost is in having to train searchers and creators of metadata so that they know when using MeSH, for example, to enter "myocardial infarction"' instead of the more colloquial "heart attack."


4. The Core Elements

This section lists each Core element by its full name and label. For each element there is a reference description (taken from the RFC) and there are guidelines to assist you in creating metadata CONTENT, whether you do it "from scratch" or by converting an existing record in another format.

The elements are listed in the order they were developed, but there are other useful ways to group them. In the following table, you can see that some elements relate to the content of the item, some to the item as intellectual property, still others to the particular instantiation, or version, of the item.

Content

Intellectual Property

Instantiation

Coverage

Contributor

Date

Description

Creator

Format

Type

Publisher

Identifier

Relation

Rights

Language

Source

   

Subject

   

Title

   

In the element descriptions below, a formal single-word label is specified to make the syntactic specification of elements simpler for encoding schemes. Although some environments, such as HTML, are not case-sensitive, it is recommended best practice always to adhere to the case conventions in the element names given below to avoid conflicts in the event that the metadata is subsequently converted to a case-sensitive environment, such as XML/RDF.

Some information may appear to belong in more than one metadata element. While there will normally be a clear preferred choice, there is potential semantic overlap between some elements, so there will occasionally be some judgment required from the metadata provider.

Many examples in this document are coded in the recommended HTML form, others are in a generic form:

ELEMENT = "value"

4.1. Title

Label: TITLE

Element Description: The name given to the resource by the CREATOR or PUBLISHER.

Guidelines for creation of content:

Drop initial articles in titles (a, an, the) if the title is in English, or in other languages where articles can be easily identified. If in doubt about what constitutes the title, repeat the TITLE element and include the variants in second and subsequent TITLE iterations. If the item is in HTML, view the source document and make sure that the title identified in the title header is also include as a meta title (unless the DC metadata element is to be embedded in the document itself). If an acronym for the CREATOR or PUBLISHER appears before the title, do not include that as part of the title unless it is grammatically linked to the title itself.

Examples:

<META name = "DC.title" content = "Pilot's Guide to Aircraft Insurance">

NOT:

<META name = "DC.title" content = "A Pilot's Guide to Aircraft Insurance">

AND:

<META name = "DC.title" content = "Tips on Buying Used Aircraft">

NOT:

<META name = "DC.title" content = "AOPA's Tips on Buying Used Aircraft">

BUT:

<META name = "DC.title" content = "AOPA urges mayor to reopen Albuquerque airport">

4. 2. Author or Creator

Label: CREATOR Element Description: The person(s) or organization(s) primarily responsible for the intellectual content of the resource. For example, authors in the case of written documents, artists, photographers, or illustrators in the case of visual resources.

Guidelines for creation of content:

CREATORs should be listed separately in the same order that they appear in the publication. Personal names should be listed surname or family name first, followed by forename or given name. When in doubt, give the name as it appears, and do not invert.

Examples:

<META name = "DC.creator" content = "Duncan, Phyllis-Anne">

CREATOR = "Melendez Santiago, Maria Luz"

CREATOR = Maimonides

BUT:

<META name = "DC.creator" content = "Park Sung Hee">

In the case of organizations where there is clearly a hierarchy present, list the parts of the hierarchy from largest to smallest, and separated by full stops.

Examples:

CREATOR = "United States. Internal Revenue Service"

CREATOR = "Elvis Presley Fan Club"

<META name = "DC.creator" content = "Federal Aviation Administration. Aviation Safety Program.">

NOT:

<META name = "DC.creator" content = "Aviation Safety Program of the Federal Aviation Administration">

If it is not clear whether there is a hierarchy present, or unclear which is the larger or smaller portion of the body, give the name as it appears in the item.

<META name = "DC.creator" content = "Art Institute of Chicago">

CREATOR = "Association of the Bar of the City of New York"

CREATOR = "Baltimore County Medical Society"

If the CREATOR and PUBLISHER are the same, do not repeat the name in the PUBLISHER area. If the nature of the responsibility is ambiguous, the recommended practice is to use PUBLISHER for organizations, and CREATOR for individuals. In cases of lesser responsibility, other than creation, use CONTRIBUTOR.

4.3. Subject and Keywords

Label: SUBJECT

Element Description: The topic of the resource, or keywords or phrases that describe the subject or content of the resource. The intent of the specification of this element is to promote the use of controlled vocabularies and keywords. This element might well include classification data (for example, Library of Congress Classification Numbers or Dewey Decimal numbers) or controlled vocabularies (such as Medical Subject Headings or Art and Architecture Thesaurus descriptors) as well.

Guidelines for creation of content:

Select subject keywords from either the TITLE or DESCRIPTION information. Avoid using words not present in either area, or equivalent terms (if another term is preferred to the one used in the description, use that, but do not use both a preferred and non-preferred term that mean the same thing as subjects). If the subject of the item is a person or an organization, use the same form of the name as you would if the person or organization were a CREATOR, but do not repeat the name in the CREATOR element.

In general, choose the most significant and unique words for keywords, avoiding those too general to describe a particular item.

Examples:

<META name = "DC.subject" content = "Aircraft leasing and renting">

SUBJECT = "Dogs"

SUBJECT = "Olympic skiing" SUBJECT = "Street, Picabo"

4.4. Description

Label: DESCRIPTION

Element Description: A textual description of the content of the resource, including abstracts in the case of document-like objects or content descriptions in the case of visual resources. Since the description field is a potentially rich source of indexable vocabulary, care should be taken to provide this element when possible.

Some metadata collections could include content description (spectral analysis of a visual resource, for example) that may not be embeddable in current network systems. In such a case this field might contain a link to such a description rather than the description itself.

Guidelines for creation of content:

Descriptive information can be taken from the item itself, if there is no abstract or other structured description available. Normally, if a DESCRIPTION cannot be found either in the introductory or front matter, or the first few paragraphs, it should be set up "on the fly" by the metadata provider. Normally, DESCRIPTION should be limited to a few brief sentences.

Example:

<META name = "DC.description" content = Illustrated guide to airport markings and lighting signals, with particular reference to SMGCS (Surface Movement Guidance and Control System) for airports with low visibility conditions.>

4.5. Publisher

Label: PUBLISHER

Element Description: The entity responsible for making the resource available in its present form, such as a publisher, a university department, or a corporate entity. The intent of specifying this field is to identify the entity that provides access to the resource.

Guidelines for content creation:

If the CREATOR and PUBLISHER are the same, do not repeat the name in the PUBLISHER area. If the nature of the responsibility is ambiguous, the recommended practice is to use PUBLISHER for organizations, and CREATOR for individuals. In cases of lesser responsibility, other than creation, use CONTRIBUTOR.

Examples:

<META name = "DC.publisher" content = "Moguls Anonymous">

PUBLISHER = "University of Miami. Dept. of Economics"

PUBLISHER = "Microsoft Corporation"

4.6. Other Contributor

Label: CONTRIBUTOR

Element Description: Person(s) or organization(s) in addition to those specified in the CREATOR element who have made significant intellectual contributions to the resource but whose contribution is secondary to the individuals or entities specified in the CREATOR element (for example, editors, transcribers, illustrators, and convenors).(See CREATOR.)

Guideline for content creation:

The same general guidelines for using names of persons or organizations as CREATORs apply here.

4. 7. Date

Label: DATE

Element Description: The date the resource was made available in its present form. Recommended best practice is an 8 digit number in the form YYYY-MM-DD as defined in http://www.w3.org/TR/NOTE-datetime, a profile of ISO 8601. In this scheme, the date element 1994-11-05 corresponds to November 5, 1994. Many other schema are possible, but if used, they may not be easily interpreted by users.

Guidelines for content creation:

If the full date is unknown, month and year (YYYY-MM) or just year (YYYY) may be used.

<META name = "DC.date" content = "1998-02-16"> <META name = "DC.date" content = "1998-02"> <META name = "DC.date" content = "1998">

DATE = "19980216"[ANSI X3.30-1985]

4.8. Resource Type

Label: TYPE

Element Description: The category of the resource, such as home page, novel, poem, working paper, technical report, essay, dictionary. For the sake of interoperability, type should be selected from an enumerated list that is under development in the workshop series at the time of publication of this document. See http://sunsite.Berkeley.EDU/Metadata/types.html for current thinking on the application of this element.

Guidelines for content creation:

Use a type from the list above.

Examples:

<META name = "DC.type" content = "image">

TYPE = "sound"

TYPE = "text.monograph"

4.9. Format

Label: FORMAT

Element Description: The data representation of the resource, such as text/html, ASCII, Postscript file, executable application, or JPEG image. The intent of specifying this element is to provide information necessary to allow people or machines to make decisions about the usability of the encoded data (what hardware and software might be required to display or execute it, for example). As with RESOURCE TYPE, FORMAT will be assigned from enumerated lists such as registered Internet Media Types (MIME types). In principal, formats can include physical media such as books, serials, or other non-electronic media.

Guidelines for content creation:

[Link to list of Mime Types?]

4.10. Resource Identifier

Label: IDENTIFIER

Element Description: String or number used to uniquely identify the resource. Examples for networked resources include URLs and URNs (when implemented). Other globally-unique identifiers, such as International Standard Book Numbers (ISBN) or other formal names would also be candidates for this element.

Guidelines for content creation:

This element can also be used for local identifiers (e.g. ID numbers or call numbers) assigned by the CREATOR of the resource to apply to a particular item.

Examples:

<META name = "DC.identifier" content = "http://purl.oclc.org/metadata/dublin_core/">

INDENTIFIER = 0385424728[ISBN]

IDENTIFIER = H-A-X 5690B[publisher number]

4.11. Source

Label: SOURCE

Element Description: Information about a second resource from which the present resource is derived. This element may contain a date, format, identifier, or other information pertaining to the second resource which is considered to be useful for discovery of the present resource. This element is not applicable for a resource that appears in its original form.

Metadata about the second resource might be available separately, and made available by following links indicated by RELATION elements. That would be the preferred model. However, in practice it might be found convenient to embed metadata relating to a source resource in the metadata of the present resource. The SOURCE element provides a home for this information, while maintaining the distinction between metadata relating strictly to the present resource and that relating strictly to the ancestor.

Guidelines for content creation:

In general, include in this area information which does not fit easily into RELATION.

Example:

<META name = "DC.source" content = "RC607.A26W574 1996">
[where "RC607.A26W574 1996" is the call number of the print version of the resource, from which the present version was scanned]

4.12. Language

Label: LANGUAGE

Element Description: Language(s) of the intellectual content of the resource. Where practical, the content of this field should coincide with RFC 1766. See: http://ds.internic.net/rfc/rfc1766.txt

Guidelines for content creation:

Coded or textual information can be represented here. If the content is in more than one language, the element may be repeated.

Examples:

LANGUAGE = en LANGUAGE = fr

<META name = "DC.language" content = "Primarily English, with some abstracts also in French.">

4.13. Relation

Label: RELATION

Element Description: The relationship of this resource to other resources. The intent of this element is to provide a means to express relationships among resources that have formal relationships to others, but exist as discrete resources themselves. For example, images in a document, chapters in a book, or items in a collection.

Relationships between resources fall into a number of classes, such as parent-child, aggregations, etc. A list of types which accomodates most expected relationships is:  

Guidelines for content creation:

The recommended method for expressing a relationship in unqualified DC is:

IDENTIFIER = "unique identifer for the present resource"
RELATION = "relationship-type; unique identifer for the related resource"

where "relationship-type" is a token drawn from the list above.

Note: In the case where the DC metadata is embedded in the present resource, the value for IDENTIFIER is implied (i.e. the present resource). In qualified DC the two components given in RELATION here will be structured using sub-elements for easier automated processing.

Examples:

IDENTIFIER = "Mt. Kosciusko" RELATION = "IsPartOf; Snowy Mountains in Australia"

[Part/Whole relations are those in which one resource is a physical or logical part of another]

IDENTIFIER = "Elton John's 1997 song Candle in the Wind" RELATION = "IsVersionOf; Elton John's 1976 song Candle in the Wind"

IDENTIFIER = "Gombrich's Story of Art" RELATION = "HasVersion; 13th Edition, 1972"

[Version relations are those in which one resource is an historical state or edition, of another resource by the same creator]

IDENTIFIER = "paper.html" RELATION = "IsFormatOf; paper.doc"

IDENTIFIER = "Landsat TM dataset of Arnhemland, NT, Australia" RELATION = "HasFormat; arnhem.gif"

[Format transformation relations are those in which one resource has been derived from another by a reproduction or reformatting technology which is not fundamentally an interpretation but intended to be a representation.]

IDENTIFIER = "Morgan's Ancient Society" RELATION = "IsReferencedBy; Engels' Origin of the Family, Private Property and the State"

IDENTIFIER = "Movie Review A" RELATION = "References; Movie G"

[Reference relations are those in which the author of one resource cites, acknowledges, disputes or otherwise make claims about another resource.]

IDENTIFIER = "Peter Carey's novel Oscar and Lucinda" RELATION = "IsBasisFor; 1998 movie Oscar and Lucinda"

IDENTIFIER = "The movie My Fair Lady" RELATION = "IsBasedOn; Shaw's play Pygmalion"

[Creative relations are those in which one resource is a performance, production, derivation, adaptation or interpretation of another resource.]

IDENTIFIER = "program.c" RELATION = "Requires; stdio.h"

IDENTIFIER = "List of Internet Media Types" RELATION = "IsRequiredBy; Dublin Core FORMAT element"

[Dependency relations are those in which one resource requires another resource for its functioning, delivery, or content and cannot be used without the related resource being present.]

4.14. Coverage

Label: COVERAGE

Element Description: The spatial and/or temporal characteristics of the resource. Formal specification of coverage is currently under development. Users and developers should understand that use of this element is currently considered to be experimental.

Guidelines for content creation:

{Will need to either link or incorporate work done to date by Mary L. and Coverage WG}

4.15. Rights Management

Label: RIGHTS

Element Description: A link to a copyright notice, to a rights-management statement, or to a service that would provide information about terms of access to the resource. Formal specification of rights is currently under development. Users and developers should understand that use of this element is currently considered to be experimental.

Guidelines for content creation:

At present, used only for free text message indicating access limitations for a particular resource.

<META name = "DC.rights" content = "Available to AOPA members only">

RIGHTS = "Available only to institutions with a subscription."

  1. Glossary

  2. Background reading and References

rev. 19feb98 dih