
|
Title:
|
On Information Factoring in Dublin Metadata Records |
|
Creator:
|
Sperberg-McQueen, C. M.
|
|
Date Issued:
|
1996-04-17
|
|
Identifier:
|
|
|
Replaces:
|
NA
|
|
Is Replaced By:
|
NA
|
|
Latest version:
|
http://dublincore.org/documents/info-factoring/
|
|
Status of document:
|
This is a DCMI Note.
|
| Description of document: |
This document describes some problems in the interpretation of metadata records which contain repeated fields, repeated field groups, or references to other metadata records. The semantics of repeated fields and groups (and, equivalently, of references to other metadata records) are described using sentential logic, and a proposal is made to specify the interpretation of repeating groups using the disjunctive normal form of corresponding logical expressions. From this proposal, requirements for grouping elements and inheritance are derived. The semantic principles involved may be of wider applicability, but all examples are from the socalled `Dublin Core' of metadata elements, described in the paper OCLC/NCSA Metadata Workshop Report, by Stuart Weibel, Jean Godby, Eric Miller, and Ron Daniel, available on the World Wide Web at http://www.oclc.org:5046/conferences/metadata/dublin_core_report.html. |
|
|
|
This document, first published in 1996, is being made available as a Dublin Core discussion note as part of the DC-Architecture Working Group's effort to formalise an XML/RDF representation of the Dublin Core. While the document is 5 years old, many of the issues and observations made are worth reconsidering in the light of subsequent work on RDF and XML. The DC Note in its current form is UNPUBLISHED and undergoing minor edits for publication on the dublincore.org site. Contact Dan Brickley (dc-architecture co-chair) if you have any queries regarding this process.
The remainder of this document is unaltered apart from minor edits for XHTML validation.
This document describes some problems in the interpretation of metadata records which contain repeated fields, repeated field groups, or references to other metadata records. The semantics of repeated fields and groups (and, equivalently, of references to other metadata records) are described using sentential logic, and a proposal is made to specify the interpretation of repeating groups using the disjunctive normal form of corresponding logical expressions. From this proposal, requirements for grouping elements and inheritance are derived. The semantic principles involved may be of wider applicability, but all examples are from the socalled `Dublin Core' of metadata elements, described in the paper OCLC/NCSA Metadata Workshop Report, by Stuart Weibel, Jean Godby, Eric Miller, and Ron Daniel, available on the World Wide Web at http://www.oclc.org:5046/conferences/metadata/dublin_core_report.html.
The data elements defined for a metadata record by the `Dublin Core' are all optional and all repeatable, and have no prescribed order. Some (e.g. author, title) relate to the intellectual content of an object (the work), while others (e.g.form) relate to particular realizations or instantiations of that intellectual content. Some (e.g. identifier, terms and conditions) may apply to all forms taken by a given item, or only to some forms and not others.
For example, consider the documentation for the TEI Lite SGML tag set. As a work, it may be described by the following metadata:
How should, could, or must metadata for such items be represented?
At the Warwick meeting, Dan LaLiberte argued that in Dublin it was agreed that a given metadata record should describe only a single realization of an intellectual object; this would help ensure that metadata records are unambiguous. I don't find this explicit in the Dublin conference report, but that report does say explicitly that multiple versions may require multiple records. Redundancy may be controlled by factoring common information (e.g. work-related information) into separate records and `inheriting' it in the records for specific realizations. On this view, the three instantiations of the TEI Lite documentation will each require a separate metadata record.
Reports at the Warwick meeting (April, 1996) from users of the Dublin core, however, make clear that in practice, there is a strong desire to put metadata for a given work in a single record, using some mechanism such as repeating groups to describe multiple realizations. This paper, for example, might be represented thus with repeating groups (I use the DTD described by Eric Miller's paper Issues of Document Description in HTML, available at http://www.oclc.org:5046/~emiller/tmp/paper.html. ):
<citation> <title>TEI Lite: An Introduction to Text Encoding for Interchange</title> <author>Lou Burnard</author> <author>C. M. Sperberg-McQueen</author> <form>TEI Lite</form> <identifier scheme='URL'> http://www-tei.uic.edu/orgs/tei/intros/teiu5.tei </identifier> <form>HTML</form> <identifier scheme='URL'> http://www-tei.uic.edu/orgs/tei/intros/teiu5.html </identifier> <identifier scheme='URL'> http://www-tei.uic.edu/orgs/tei/intros/teiu5.split.html </identifier> </citation>
The only problem with this method is that it requires a lot of intelligence in the reader or user of the metadata to interpret the meaning of fields which occur more than once. A human may easily realize that the first form (TEI Lite) applies only to the first identifier, and that the second and third identifiers are for objects in the second form (HTML); software will realize it only if suitably instructed. A human will realize, perhaps even without conscious thought, that the two <author> elements both apply, at the same time, to all instantiations of the paper, because there are two authors for the paper, while the two <form> elements each relate to separate and distinct instantiations of the paper. Software is unlikely to realize this critical difference without help.
The association of form and identifier information can be made explicit, using the <instance> element of Eric Miller's DTD:
<citation> <title>TEI Lite: An Introduction to Text Encoding for Interchange</title> <author>Lou Burnard</author> <author>C. M. Sperberg-McQueen</author> <instance> <form>TEI Lite</form> <identifier scheme='URL'> http://www-tei.uic.edu/orgs/tei/intros/teiu5.tei </identifier> </instance> <instance> <form>HTML</form> <identifier scheme='URL'> http://www-tei.uic.edu/orgs/tei/intros/teiu5.html </identifier> <identifier scheme='URL'> http://www-tei.uic.edu/orgs/tei/intros/teiu5.split.html </identifier> </instance> </citation>This is an improvement, but not a full solution (note, for now, that two HTML identifiers still require different interpretation from the two <author> elements).
If common information is factored out into other records, we may be able to escape some of these logical difficulties, but we need a clear explanation of how information in the local record and the inherited information imported from an external record are to be related: are they always additive, even if the same field appears in both records? Or does a local field `override' the inherited value for that field?
We can get a better grip on the problem if we apply some principles of formal logic. The simplest way to formalize the semantics of a Dublin metadata record, it seems to me, is using sentential logic. Existential quantifiers may also be used, and I describe that possibility briefly, enough to persuade myself that the more complex formalism does not require a more complex syntax for metadata records. Either approach allows us first to express more clearly the types of ambiguity arising from repeated fields or groups, and second to see what sorts of mechanisms might suffice to disambiguate them.
Let us consider first why a simple record like the following seems less problematic than the sample given above:
<citation> <title>On the Pulse of Morning <author>Maya Angelou <publisher>University of Virgina Library Electronic Text Center <otherAgent>Transcribed by the University of Virginia Electronic Text Center <date>1993 <object>Poem <form>1 ASCII file <source>Newspaper stories and oral performance of text at the presidential inauguration of Bill Clinton <language>English </citation>
The key difference, I believe, is that all of the metadata in this record unambiguously applies all the time, while some elements of the previous record apply only in conjunction with certain other elements.
If we express each element as a logical proposition, the simple record has a correspondingly simple logical form. For convenience, let us give each proposition a short name:
(T & A & P & D & OA &
Ob & F & S & L), or "The item has the title
On the Pulse of Morning and the item was written
by Maya Angelou and ...".The more complex record has a more complex logical structure. If we name the propositions thus:
(T & A1 & A2 & ((F1 & I1)
| (F2 & (I2 | I3)))), which can be paraphrased in
English roughly thus:
- The item is called "TEI Lite: ..."
- and it was written by Lou Burnard
- and it was (also) written by C. M. Sperberg-McQueen
- and it is
- either in TEI Lite as .../teiu5.tei
- or in HTML
- either as .../teiu5.html
- or as .../teiu5.split.html.
Each individual instance can be described (as Dan LaLiberte points out) with a simple metadata record, which translates into a simple formula:
T & A1 & A2 & F1 & I1T & A1 & A2 & F2 & I2T & A1 & A2 & F2 & I3I believe that this simple form of expression, in which the
only connector is and, corresponds to the class of
metadata records which are unambiguous and easy to interpret.
The problem of interpreting complex metadata records (ones with
repeating fields or groups) can thus be paraphrased: how do we
derive a set of simple and-expressions from the
logical expression representing a complex metadata record?
Fortunately, the answer is simple.
If we combine the three simple expressions into a single formula, we get a paraphrase of the metadata record as a whole:
( (T & A1 & A2 & F1 & I1) | (T & A1 & A2 & F2 & I2) | (T & A1 & A2 & F2 & I3) )which can be paraphrased roughly thus:
(if you have an item in hand described by this metadata record, then one of these three things is true:)
- either the title is TEI Lite ... and the author(s) are LB and CMSMcQ and the form is TEI Lite the URL is .../teiu5.tei
- or the title is TEI Lite ... and the author(s) are LB and CMSMcQ and the form is HTML the URL is .../teiu5.html
- or the title is TEI Lite ... and the author(s) are LB and CMSMcQ and the form is HTML the URL is .../teiu5.split.html
The salient point (and the only interesting or new claim in
this entire paper) is that this expression is logically
equivalent to the original formula for the example, but unlike
the original this one is in disjunctive normal form.[1] It is fortunately
not hard to generate the disjunctive normal form of arbitrary
logical expressions, particularly when (as here) the only
operators allowed are and and or.
We can then describe the semantics of metadata records thus:
and-ing together (conjunction) of its
sub-elements.or-ing together (disjunction) of several
simple records, each represented by one term in the complex
record's disjunctive normal form.We do need, however, a way to make explicit not only the
parenthetical groupings in the formula
(<instance> does this) but also which
propositions in the formula are joined by and
(&) and which by or (|). We can see therefore
that proposals calling for a single grouping element (such as
that made by Eric Miller in the paper already mentioned, or by
myself in informal DTD sketches) will not suffice to solve the
problem. We need not one but two distinct types of group.
Miller's <citation> element already serves as an
and-group, since simple citations are interpreted
as the and-ing together (formally, the
conjunction) of their elements. It will have to be able to nest
recursively, however, if we want to handle all cases of shared
metadata. And we will need a second grouping element, to serve
as an or-group. For examples, see the section
Groups, below.
Some readers may resist the use of sentential logic as a formalism for representing the meanings of metadata records in general, since the meaning of
<citation> <title>On the pulse of morning</title> </citation>is not, in general, merely "The title is On the pulse of morning" but something more like "(There is an object, described by this record, and) the title (of the object described by this record) is On the pulse of morning." That is, there is an implied existential quantifier inherent in the existence of a metadata record, and there is an implied argument for each metadata element, viewed as a logical function.
Paraphrasing records at this level of detail would make it easier to capture the semantics of work and realization more clearly. Represented in first-order predicate calculus, our example might look like this:
(E w)(E lb)(E cmsmcq)(E i1)(E i2)(E i3)
( work(w)
& title(w,"TEI Lite ...")
& name(lb,"Lou Burnard")
& name(cmsmcq,"C. M. Sperberg-McQueen")
& author(w,lb) & author(w,cmsmcq)
& instance(w,i1)
& form(i1,teilite)
& url(i1,".../teiu5.tei")
& instance(w,i2)
& form(i2,html)
& url(i2,".../teiu5.html")
& instance(w,i3)
& form(i3,html)
& url(i3,".../teiu5.html")
& (i1 != i2) & (i1 != i3)
)
which we might paraphrase as:
Note, in passing, that from the <form>
elements we can infer that the first instantiation is not
identical to the second or third, but the second and third
instantiations, both being in HTML, could conceivably be
identical. Hence there is no claim that (i2 !=
i3).
If we are willing to assume that different instantiations
are the only possible causes of or-groups in
metadata records, then we may plausibly believe (a) that
complex metadata records can all be described with a single
and-group, if instantiations are given identifiers
(such as the i1, i2, i3 of the
example) and the identifiers are used to associate the metadata
elements applying to each instantiation, and (b) that Eric
Miller's <instance> element suffices, after all,
since all instances are implicitly or-ed with each
other, and nothing else will cause an or
group.
I'm reluctant to accept this logic, first because while many (all?) examples of logical complication in metadata records do involve multiple instantiations, I certainly haven't seen any argument that proves this is a logical necessity. Second, tempting though this argument is, I still don't know how to derive the formula just given systematically from the metadata record itself. The formula has three instantiations, and three form() predicates, while the metadata record itself has three <identifier> elements, but only two <instance> elements, and only two <form> elements.
We saw earlier, when we used sentential logic to say what
metadata records mean, that we need both a grouping element
meaning and and one meaning or. The
or-group we must invent. For now, let's call it
<or>. The and-group we already
have, in the citation element. The only drawback
is that the term citation seems to imply that its
contents constitute a complete citation, which will not always
be the case. For purposes of illustration, therefore, let's
invent a second new grouping element called
<and>.
If we augment Eric Miller's DTD with <or> and <and>, our example record will look like this (I augment the <or> and <and> elements with identifiers, so I can refer to them in later discussion):
<citation>
<title>TEI Lite: An Introduction to
Text Encoding for Interchange</title>
<author>Lou Burnard</author>
<author>C. M. Sperberg-McQueen</author>
<or id=O1>
<and id=A1>
<form>TEI Lite</form>
<identifier scheme='URL'>
http://www-tei.uic.edu/orgs/tei/intros/teiu5.tei
</identifier>
</and>
<and id=A2>
<form>HTML</form>
<or id=O2>
<identifier scheme='URL'>
http://www-tei.uic.edu/orgs/tei/intros/teiu5.html
</identifier>
<identifier scheme='URL'>
http://www-tei.uic.edu/orgs/tei/intros/teiu5.split.html
</identifier>
</or>
</and>
</or>
</citation>
It might be suggested (I suggested it myself, in the first draft of these notes) that we don't really need the <or> element everywhere it occurs in the example just given. It would be clear enough simply to write
<and id=A2>
<form>HTML</form>
<identifier scheme='URL'>
http://www-tei.uic.edu/orgs/tei/intros/teiu5.html
</identifier>
<identifier scheme='URL'>
http://www-tei.uic.edu/orgs/tei/intros/teiu5.split.html
</identifier>
</and>
since the URL is a characteristic of the realization, not the
work, and in general two identifiers in the same scheme will
always refer to distinct realizations. They thus might
as well be regarded as forming a sort of automatic, implicit
or-group.On this view, some elements are implicitly
and-ed together when they repeat: author, for
example. Some elements (e.g. identifier, form) are
intrinsically incapable of being and-ed together
and thus form implicit or-groups. Some elements
can go either way: multiple titles may all apply, or they might
apply each to one particular instantiation of the work (the
French title applies to the French version, the English title
to the English version of a European regulation, which might
need juridically to be treated as a single work, since all
national-language versions have equal authority).
On the whole, it seems better not to make too much of this generalization -- though it might be a useful heuristic for plausibility checking. Since some elements can go either way, we will need <and> and <or> (or rather, their logical equivalents: I am not proposing actual elements here, just pointing out the need for elements with conjunctive and disjunctive meaning) regardless, and using such elements explicitly seems simpler and less confusing than hard-wiring so much intelligence into software.
It also is worth pointing out that the initial premise of this section is false: two URLs may very easily point to the same object, and it is easy to imagine methods of describing formats which would allow multiple names to be applied to the same format (just as in some programming languages the same data type can be referred to by multiple names).
There are three ways to treat inheritance of metadata from
other records. We can insist that the inherited metadata never
include the same elements as are present locally, or we can
specify that locally specified elements override inherited
elements of the same name, or we can attempt to specify some
method of merging the two records so as to keep all the
information from both records, by and-ing or
or-ing corresponding elements together. In the
first event, it may be overkill to speak of `inheritance'; in
the latter, we may be reintroducing all the problems of
repeating groups.
If we take the first or the second approach (or even the
third approach, as long as we provide a simple rule, such as
"All inherited data is and-ed together with local
data"), we will be able to interpret a reference to external
metadata fairly rigorously:
By making liberal use of references to other records, we can do without <and> and <or> elements. We can demonstrate this by giving a method of transforming records with <and> and <or> into sets of records linked by reference. For each SGML element in the source record, we do the following:
inheritanceOr perhaps it would be clearer to put it this way: We begin by copying the entire <citation> element and all its children into a new record, which we then process as follows:
The sample record from the previous section would turn into the following set of records:
http://www.meta.org/catalog/c):
<citation>
<title>TEI Lite: An Introduction to
Text Encoding for Interchange</title>
<author>Lou Burnard</author>
<author>C. M. Sperberg-McQueen</author>
</citation>
http://www.meta.org/catalog/a1):
<citation id=A1>
<relation scheme='URL' type='OtherType' othertype='inherits'>
http://www.meta.org/catalog/c
</relation>
<form>TEI Lite</form>
<identifier scheme='URL'>
http://www-tei.uic.edu/orgs/tei/intros/teiu5.tei
</identifier>
</citation>
http://www.meta.org/catalog/a2):
<citation id=A2>
<relation scheme='URL' type='OtherType' othertype='inherits'>
http://www.meta.org/catalog/c
</relation>
<form>HTML</form>
</citation>
http://www.meta.org/catalog/h1):
<citation>
<relation scheme='URL' type='OtherType' othertype='inherits'>
http://www.meta.org/catalog/a2
</relation>
<identifier scheme='URL'>
http://www-tei.uic.edu/orgs/tei/intros/teiu5.html
</identifier>
</citation>
http://www.meta.org/catalog/h2):
<citation>
<relation scheme='URL' type='OtherType' othertype='inherits'>
http://www.meta.org/catalog/a2
</relation>
<identifier scheme='URL'>
http://www-tei.uic.edu/orgs/tei/intros/teiu5.split.html
</identifier>
</citation>
More work is needed here, I think, both to specify how to interpret the record when the same element occurs both locally and in the referenced object, and to specify what constitutes the same element.
An adequate syntax for multiple versions (realizations) of
the same work (intellectual content) requires an explicit
semantic interpretation, to avoid hopeless ambiguity. If we
provide our syntax with mechanisms for both disjunctive and
conjunctive groupings (and-groups and
or-groups), we can provide simple rules for
interpreting complex records in terms of their disjunctive
normal form.
More complex semantic formalisms, using existential quantifiers, may also be defined, but do not require any syntax more elaborate than the simpler semantics.
[1] A formula in
sentential logic is in disjunctive normal form if it
is a disjunction (or alternation, or or-group) of
one or more terms, and if each term is a conjunction
of one or more primitive sentences or their negations. No
nested expressions are allowed. For a fuller discussion, any
book on formal logic may be consulted, but perhaps the best
discussion of disjunctive normal form and the algebraic
manipulations used to achieve it may be found in W.V. Quine,
Methods of logic 4th ed. (Cambridge, Mass. :
Harvard University Press, 1982). [return to
text]
Copyright © 1995-2010 DCMI. All Rights Reserved.