Andrew Waugh
andrew.waugh@cmis.csiro.au
Resource Discovery Unit
DSTC
CSIRO Mathematics and Information Sciences

A Strawman Proposal for Metadata Extensibility

1. The Problem

Extensiblity means the ability to change or augment metadata standards - either 'officially' (by changing the standard) or 'unofficially' (to support local extensions).

Extensions range from adding a new metadata element to creating an entirely new metadata set.

Extensibility causes problems to both the human users of the metadata system and to the software that supports them.

The human creating metadata first of all needs to be notified that standard has changed, and then needs to be able to find out what the changes mean. What values, for example, are legal in the new metadata element? The human viewing metadata needs to be able to find out what the information entered into the changed metadata elements actually means.

Note that the human issues of extensibility are a special case of the general problem of understanding metadata. In current systems, for example, how does a new creator discover what metadata elements are available for use and what data should be entered into them?

For software designers, the issue is how to reconfigure the software used by metadata creation tools, metadata storage engines, and metadata display programs to take into account the changed metadata standard. For a widely distributed system this reconfiguration must be automatic (or as close to automatic as possible). The economic cost of manual reconfiguration is simply prohibitive, even if it were practicable.

Further, the ability to easily configure metadata software to handle different metadata sets opens the possibility of metadata software which can easily handle multiple metadata sets (even ones not yet defined). This reduces the cost of the development of metadata systems and assists in the introduction of such systems.

2. A Strawman Solution

The essence of metadata extensibility is the ability to electronically publish a metadata schema. This allows metadata creators to understand what information they should add to a set of metadata, and metadata users to understand the meaning of the metadata sets they receive. Representing this schema in a machine processable form allows software to automatically adjust to new metadata elements.

The following strawman proposal has been based on the schema specification system used in X.500. Experience has shown that this system is very extensible; although with certain limitations. The use of HTML to publish the schema is designed to address these limitations.

A mechanism for linking a reference description of a metadata schema has been proposed in [1]. This proposal is simply an extension of this schema idea to address the problems of extensibility.

This proposal is intended to broadly indicate how a metadata schema system would work and what information it would contain. The examples are not intended as syntax proposals. Nor is their any suggestion that this is all that the schema would contain.

3. Metadata Schema Publishing

The basic approach is to include in each metadata set a URI pointing to the metadata schema specification for that set. The specification can be viewed as a more formal version of the document proposed in [1], so the same linkage mechanism could be used:

<LINK REL = SCHEMA.dc HREF = "http://purl.org/metadata/dublin_core">

which states that a set of Web pages describing the schema used to construct this metadata set can be found at the specified URL. (A URN would be preferable, but in the absence of a deployed URN scheme, URLs will work.)

The structure of the pages pointed to by the metadataSchema URL are shown in Figure 1.

The root page describes the metadata set as a whole, and includes lists of metadata elements (attributes) which may and must be contained in the set. Each metadata element has a page which describes the purpose of that element, and links to pages which describe the syntax and semantics of the element.

4. Metadata Set Schema

The page describing the Metadata Set would contain the following information

The Dublin Core metadata set might be described in the following example:

Dublin Core Metadata Element Set Version 2 Date 15 January 1997

The Dublin Core Metadata Element Set is intended to describe document like objects. This includes documents, maps, programs.

Mandatory Elements:

Optional Elements:

The Dublin Core Metadata Element Set is standardised in http://purl.org/metadata/dublin_core_elements

5. Metadata Element Schema

A page describing a metadata element would contain the following information

An example might be:

Language

Language(s) of the intellectual content of the resource. Where practical, the content of this field should coincide with the Z39.53 three character codes for written languages.

Syntax: String
Semantics: Z39.53
Standard: http://purl.org/metadata/dublin_core_elements

The metadata element schema has been separated from the page describing the metadata set as a whole to allow individual metadata elements to be referenced by multiple metadata set specifications.

6. Metadata Element Syntax Schema

This portion of the schema specifies syntactic limitations on metadata values. For example, a metadata values may be limited to 'upper case strings'.

A syntax instructs the user of the metadata set (human or machine) how metadata values are represented. This allows values to be consistent and aids machine processing and display of previously unrecognised metadata elements.

At a basic level, a syntax is equivalent to a type in a programming language. The basic syntaxes are strings, integers, floating point numbers and booleans. Although most (all?) current metadata elements would fall into one of these four basic syntaxes, it is likely that more complicated, structured, syntaxes will be used in future (e.g. a metadata element might contain a sequence of a string, an integer, and then a further string).

The syntax may also specify limitations on the values over and above the broad limitations of strings, integers, floats, and booleans. A string might have a maximum number of characters or on the range of characters.

A page describing a metadata element syntax would contain the following information

An example might be:

Telephone Number Syntax

A telephone number is a string of up to 32 characters which complies with ITU recommendation E.123. Such a string has the following format:

The telephone number may contain spaces. An example telephone number is '+61 3 9282 2615'

Two telephone numbers are the same if they have the same digits in the same order. Spaces are not significant.

The metadata element syntax has been separated from the metadata element specification itself so that many metadata elements may share the same syntax (e.g. fax machines are labeled by telephone numbers).

There is a close relation between the syntax and semantics of a metadata element; both can control the values that can be placed in the metadata element. The syntax is intended to be used to specify mechanical rules controlling values (e.g. all characters must be in upper case). The semantics are intended to control values by specifying meaning (e.g. taxonomies). However, there is a large overlap and one or other might be more appropriate in a particular case.

7. Metadata Element Semantic Schema

This portion of the schema specifies semantic limitations on metadata values. For example, the values might be limitted to Library of Congress subject headings.

A metadata element may have multiple semantic schemes (and so the actual metadata element needs to contain a qualifier) specifying which scheme is actually used.

A page describing metadata element semantics would contain the following information:

The metadata element semantics have been separated from the metadata element specification itself so that many metadata elements may share the same semantics.

There is a close relation between the syntax and semantics of a metadata element; both can control the values that can be placed in the metadata element. The syntax is intended to be used to specify mechanical rules controlling values (e.g. all characters must be in upper case). The semantics are intended to control values by specifying meaning (e.g. taxonomies). However, there is a large overlap and one or other might be more appropriate in a particular case.

8. Bootstrapping

A set of pages describing a metadata scheme are useless if creators of metadata sets do not know the URI of the page to include in the metadata set. Possible methods of bootstrapping this knowledge include:

9. Use of the Web Pages

The Web pages describing the metadata schema can be used in two ways:

References

[1] A Proposed Convention for Embedding Metadata in HTML, Reported by Stuart Weibel, June 1996
at <URL:http://www.oclc.org:5046/~weibel/html-meta.html>