innovation in metadata design, implementation & best practices

User Guide Working Draft

Title:

User Guide Working Draft

Creator:
Date Issued:
1998-01-30
Identifier:
Replaces:
Not applicable
Is Replaced By:
Latest Version:
Status of Document:
This is a DCMI Working Draft
Description of Document: A user guide for simple Dublin Core.

A USER GUIDE FOR SIMPLE DUBLIN CORE

DRAFT VERSION 2

1. INTRODUCTION

1.1. What is metadata?

The concept of metadata predates the Web. The descriptive information about any information resource is called metadata, which simply means data about data. Metadata can describe many aspects of an information resource and help provide access to an information resource. For example, a record in a library catalogue is a set of metadata which is associated with the book (or other information resource) in the library collection. Similarly, metadata can be linked to the information resource on the WWW, or stand alone in a separate database.

The metadata record consists of a handful of elements designed to make it easy to find a resource, for example: the author of the book or the Web resource, the proper title, the date the resource was created or published, a link to the resource itself or related resources and so on.

1.2. What is this thing called Dublin Core?

Dublin Core metadata is a small set of easily created elements that can be applied to most kinds of information resources. This set was conceived by an international interdisciplinary group of professionals from librarianship, computer science, and text encoding in order to improve Internet resource discovery.

Dublin Core favors document-like objects because traditional text resources are fairly well-understood, although it can apply to other resources in proportion to how closely their metadata resembles typical document metadata.

Dublin Core has as it's goals simplicity, flexibility and semantic interoperability. Dublin Core metadata must be of a form suitable for interpretation both by the search engines and by human beings.

1.3. The purpose and scope of this guide

This document is intended to help people who have no training in cataloging or indexing to create simple descriptive records for information resources (for example, electronic documents). Creators of these records include authors, editors, and World-Wide Web (WWW) [1] site administrators.

The guide will show in a non-technical fashion how Dublin Core metadata may be used by anyone to make their material more accessible. This guide discusses the layout and content of Dublin Core Metadata elements, and how to use them in composing a complete Dublin Core Metadata record.

An important goal of this document is to encourage metadata consistency as a way to promote complete retrieval and intelligible display across disparate sources of descriptive records. Inconsistent metadata effectively hides desired records.

For further reading: [link to general bibliography section]


2. Why HTML?

Dublin Core concepts are equally applicable to virtually any file format. In the case of this guide, the HTML implementation is stressed because this is the area in which the underlying concepts may most easily be demonstrated at the present time.

2.1. Stand-Alone Metadata

Each record element definition begins with "". In the example below each definition happens to fit on one line, but in general a definition can span several lines. Stand-alone metadata can exist in any kind of database.

The first example describes a photograph in another file that has a location given by a Uniform Resource Locator (URL) [2]. The entire record file looks like this:

<META name="DC.title" content="Kita Yama (Japan)"> <META name="DC.author" content="Kertesz, Andre"> <META name="DC.date" content="1968"> <META name="DC.type" content="e/photograph"> <META name="DC.format" content="GIF"> <META name="DC.identifier" content="http://foo.bar.zaf/kertesz/kyama">

2.2. Metadata Contained in a Resource

The next example of a metadata record is contained in a file along side the document that it describes. The document is a short poem expressed in HTML, the Web's Hypertext Markup Language [3].

<HTML> <HEAD> <TITLE>Song of the Open Road</TITLE> <META name="DC.title" content="Song of the Open Road"> <META name="DC.author" content="Nash, Ogden"> <META name="DC.type" content="e/document"> <META name="DC.date" content="1939"> <META name="DC.format" content="HTML"> <META name="DC.identifier" content="http://www.poetry.com/nash/open.html"> </HEAD> <BODY><PRE> I think that I shall never see A billboard lovely as a tree. Indeed, unless the billboards fall I'll never see a tree at all. </PRE></BODY> </HTML>

Indexing programs understand that the metadata record starts after the "<TITLE" line and ends before the "</HEAD" line, and are thus able to extract metadata automatically. Meanwhile, the metadata in the same file does not appear during normal document formatting or printing, and metadata-aware Web browsers may even be able to exploit it.

A number of the current search engines have begun to include the ability to make use of the HTML tag in Web documents. Alta Vista, for example, makes use of DESCRIPTION and KEYWORDS qualifiers to the tag in order to index a given page. The DESCRIPTION is returned in response to a search, rather than the default (but usually far less useful) first couple of lines of text.


3. Basic Principles of Descriptive Elements

The notation (one of several) described in this guide is based on the HTML META tag. The character set assumed is standard UNICODE with the UTF-8 encoding. This allows for a very wide range of writing systems while remaining compatible with the traditional ASCII character set.

3.1. Element Parts and Syntax

Each descriptive element definition has a Name part and a Content part, as in

<META name="DC.author" content="Browning, Elizabeth">

Any metadata element may be omitted or repeated.

3.2. Element Content and Controlled Vocabularies

Content data for some elements may be restricted to a "controlled vocabulary," which is a limited set of consistently used and carefully defined terms. This can dramatically improve search results because computers are good at matching words character by character but weak at understanding the way people refer to one concept using different words (having very different characters). Without basic terminology control, inconsistent or incorrect metadata can profoundly degrade the quality of search results.

One cost of a controlled vocabulary is in needing an administrative body to review, update, and disseminate the vocabulary. For example, the US Library of Congress Subject Headings (LCSH) [4] and the US National Library of Medicine Medical Subject Headings (MeSH) [5] are formal vocabularies indispensable for searching rigorously cataloged collections and both require significant support organizations. Another cost is in having to train searchers and creators of metadata so that they know when using MeSH, for example, to enter "myocardial infarction"' instead of the more colloquial "heart attack."


4. The Core Elements

This section lists each Core element by its full name and label. For each element there is a reference description (taken from the RFC) and there are guidelines to assist you in creating metadata Content, whether you do it "from scratch" or by converting an existing record in another format.

In the element descriptions below, a formal single-word label is specified to make the syntactic specification of elements simpler for encoding schemes. Although some environments, such as HTML, are not case-sensitive, it is recommended best practice always to adhere to the case conventions in the element names given below to avoid conflicts in the event that the metadata is subsequently converted to a case-sensitive environment, such as XML/RDF.

Examples in this document are coded in the recommended HTML form.

4.1. Title

Label: TITLE

Element Description: The name given to the resource by the CREATOR or PUBLISHER.

Guidelines for creation of content:

Drop initial articles in titles (a, an, the) if the title is in English, and in other languages where articles can be easily identified. If in doubt about what constitutes the title, repeat the TITLE element and include the variants in second and subsequent TITLE iterations. If the item is in HTML, view the source document and make sure that the title identified in the title header is also include as a meta title (unless the DC metadata element is to be embedded in the document itself). If an acronym for the CREATOR or PUBLISHER appears before the title, do not include that as part of the title unless it is grammatically linked to the title itself.

Examples:

<meta name = "DC.title" content = "Pilot's Guide to Aircraft Insurance">

NOT:

<meta name = "DC.title" content = "A Pilot's Guide to Aircraft Insurance">

BUT:

<meta name = "DC.title" content = "AOPA urges mayor to reopen Albuquerque airport"?

NOT:

<meta name = "DC.title" content = "AOPA's Tips on Buying Used Aircraft"<

4. 2. Author or Creator

Label: CREATOR Element Description: The person(s) or organization(s) primarily responsible for the intellectual content of the resource. For example, authors in the case of written documents, artists, photographers, or illustrators in the case of visual resources.

Guidelines for creation of content:

CREATORs should be listed separately in the same order that they appear in the publication. Personal names should be inverted, or listed surname first, followed by forename (or first name), unless the name is in a language where surname normally appears first (some Asian languages, for instance). When in doubt, give the name as it appears, and do not invert.

Examples:

<meta name = "DC.creator" content = "Duncan, Phyllis-Anne">

BUT:

<meta name = "DC.creator" content = "Park Sung Hee">

In the case of organizations where there is clearly a hierarchy present, list the parts of the hierarchy in this order: largest, followed by smallest, and separate them by periods.

Examples:

<meta name = "DC.creator" content = "Federal Aviation Administration. Aviation Safety Program.">

NOT:

<meta name = "DC.creator" content = "Aviation Safety Program of the Federal Aviation Administration.">

If it is not clear whether there is a hierarchy present, or unclear which is the larger or smaller portion of the body, give the name as it appears in the item.

Do not repeat the name of the organization responsible for the publication in the PUBLISHER area. If the nature of the responsibility is ambiguous, use PUBLISHER in preference to CREATOR in the case of organizations, and the obverse in the case of persons. In cases of lesser responsibility, other than creation, use CONTRIBUTORS.

4.3. Subject and Keywords

Label: SUBJECT

Element Description: The topic of the resource, or keywords or phrases that describe the subject or content of the resource. The intent of the specification of this element is to promote the use of controlled vocabularies and keywords. This element might well include scheme-qualified classification data (for example, Library of Congress Classification Numbers or Dewey Decimal numbers) or scheme-qualified controlled vocabularies (such as Medical Subject Headings or Art and Architecture Thesaurus descriptors) as well.

Guidelines for creation of content:

Select subject keywords from either the TITLE or DESCRIPTION information. Avoid using words not present in either area, or equivalent terms (if another term is preferred to the one used in the description, use that, but do not use both a preferred and non-preferred term that mean the same thing). If the subject of the item is a person or an organization, use the same form of the name as you would if the person or organization were a CREATOR, but do not repeat the name in the CREATOR element.

In general, choose the most significant and unique words for keywords, avoiding those too general to describe a particular item.

If the Library of Congress Subject Heading appropriate to the item is known, it can be used with the scheme=LCSH.

Example:

<meta name = "DC.subject" content = "(scheme=LCSH) Aircraft leasing and renting">

{or in HTML 4.0 <meta name = "DC.subject" scheme="LCSH" content = "Aircraft leasing and renting">}

4.4. Description

Label: DESCRIPTION

Element Description: A textual description of the content of the resource, including abstracts in the case of document-like objects or content descriptions in the case of visual resources. Future metadata collections might well include computational content description (spectral analysis of a visual resource, for example) that may not be embeddable in current network systems. In such a case this field might contain a link to such a description rather than the description itself.

Guidelines for creation of content:

Descriptive information can be taken from the item itself, if there is no abstract or other structured description available. Normally, if a DESCRIPTION cannot be found either in the introductory or front matter, or the first few paragraphs, it should be set up "on the fly." Normally, DESCRIPTION should be limited to a few brief sentences, but care should be taken to provide useful vocabulary within them.

Example:

<meta name = "DC.description" content = Illustrated guide to airport markings and lighting signals, with particular reference to SMGCS (Surface Movement Guidance and Control System) for airports with low visibility conditions.>

4.5. Publisher

Label: PUBLISHER

Element Description: The entity responsible for making the resource available in its present form, such as a publisher, a university department, or a corporate entity. The intent of specifying this field is to identify the entity that provides access to the resource.

Guidelines for content creation:

Do not repeat the name of the organization responsible for the publication in the CREATOR area. If the nature of the responsibility is ambiguous, use PUBLISHER in preference to CREATOR in the case of organizations, and the obverse in the case of persons.

4.6. Other Contributors

Label: CONTRIBUTORS

Element Description: Person(s) or organization(s) in addition to those specified in the CREATOR element who have made significant intellectual contributions to the resource but whose contribution is secondary to the individuals or entities specified in the CREATOR element (for example, editors, transcribers, illustrators, and convenors).(See CREATOR.)

Guideline for content creation:

The same general guidelines for using names of persons or organizations as CREATORs apply here.

4. 7. Date

Label: DATE

Element Description: The date the resource was made available in its present form. The recommended best practice is an 8 digit number in the form YYYYMMDD as defined by ANSI X3.30-1985. In this scheme, the date element for the day this is written would be 19961203, or December 3, 1996. Many other schema are possible, but if used, they should be identified in an unambiguous manner.

Guidelines for content creation:

Dates are entered as written on the document or source. If none is available, use other cues from the document or leave this element blank.

{need an example here?}

4.8. Resource Type

Label: TYPE

Element Description: The category of the resource, such as home page, novel, poem, working paper, preprint, technical report, essay, dictionary. It is expected that RESOURCE TYPE will be chosen from an enumerated list of types. A preliminary set of such types reside at: http://sunsite.Berkeley.EDU/Metadata/types.html

Guidelines for content creation:

Use a type from the list above. {Perhaps some examples?}

4.9. Format

Label: FORMAT

Element Description: The data representation of the resource, such as text/html, ASCII, Postscript file, executable application, or JPEG image. The intent of specifying this element is to provide information necessary to allow people or machines to make decisions about the usability of the encoded data (what hardware and software might be required to display or execute it, for example). As with RESOURCE TYPE, FORMAT will be assigned from enumerated lists such as registered Internet Media Types (MIME types). In principal, formats can include physical media such as books, serials, or other non-electronic media.

Guidelines for content creation:

[Link to list of Mime Types?]

4.10. Resource Identifier

Label: IDENTIFIER

Element Description: String or number used to uniquely identify the resource. Examples for networked resources include URLs and URNs (when implemented). Other globally-unique identifiers, such as International Standard Book Numbers (ISBN) or other formal names would also be candidates for this element.

Guidelines for content creation:

For URLs use:

<meta name = "DC.identifier" content = "(scheme=URL) http://purl.oclc.org/metadata/dublin_core/">

For all other identifying numbers on the document, use the element without a qualifier.

{Give example?}

4.11. Source

Label: SOURCE

Element Description: Information about a second resource from which this resource is derived. This element may contain a date, format, identifier, or other information pertaining to the second resource. This element is not applicable for a resource that appears in its original form. It may be desirable to have a separate metadata package for the second resource; in that case, use of the Relation element is recommended.

Guidelines for content creation:

Use SOURCE when the item is from a larger entity, such as an article from a numbered issue of a journal or newsletter. Include the numbering and date of the source document, without qualifier.

Example:

<meta name = "DC.source" content = "Editor's Runway, FAA Aviation News, May/June 1997">

4.12. Language

Label: LANGUAGE

Element Description: Language(s) of the intellectual content of the resource. Where practical, the content of this field should coincide with the *Z39.53 three character codes for written languages.

See: http://www.sil.org/sgml/nisoLang3-1994.html

Guidelines for content creation:

{need to give examples}

*May need to incorporate definition in current RFC ie "Where practical, the content of this field should coincide with RFC 1766 [3]; examples include en, de, es, fi, fr, ja, th, and zh."

4.13. Relation

Label: RELATION

Element Description: An identifier of a second resource and its relationship to this resource. This element is a means of linking separate metadata packages of related resources. Examples include a translation of a work, a chapter of a book, or a mechanical transformation. For the sake of interoperability, relationships should be selected from an enumerated list that is currently under development in the workshop series.

Guidelines for content creation:

{Need examples? Can we provide any - check out D. Bearman's comments}

4.14. Coverage

Label: COVERAGE

Element Description: The spatial and/or temporal characteristics of the intellectual content of the resource. Any date in this element is concerned with what the resource is about rather than when it was created or made available, the latter belonging in the Date element. Formal specification of Coverage is currently under development.

Guidelines for content creation:

{Will need to either link or incorporate work done to date by Mary L. and Coverage WG}

4.15. Rights Management

Label: RIGHTS

Element Description: The content of this element is intended to be a link (a URL or other suitable URI as appropriate) to a copyright notice, a rights-management statement, or perhaps a server that would provide such information in a dynamic way. The intent of specifying this field is to allow providers a means to associate terms and conditions or copyright statements with a resource or collection of resources. No assumptions should be made by users if such a field is empty or not present.

Guidelines for content creation:

At present, used only for free text message indicating access limitations at a particular web site.

<meta name = "DC.rights" content = "Available to AOPA members only">

  1. Glossary

  2. Background reading and References

rev. 30jan98 dih