DCMI DCSV
DCMI DCSV: A syntax for representing simple structured data in a text string
Creator: | Simon Cox |
---|---|
Creator: | Renato Iannella |
Contributor: | Andy Powell |
Contributor: | Andrew Wilson |
Contributor: | Pete Johnston |
Contributor: | Thomas Baker |
Date Issued: | 2006-04-10 |
Identifier: | http://dublincore.org/specifications/dublin-core/dcmi-dcsv/2006-04-10/ |
Replaces: | http://dublincore.org/specifications/dublin-core/dcmi-dcsv/2000-07-28/ |
Is Replaced By: | Not Applicable |
Latest version: | http://dublincore.org/specifications/dublin-core/dcmi-dcsv/ |
Status of document: | This is a DCMI Recommendation. |
Description of document: | This document describes a method for recording simple structured data in a text string, or structured value string. This method is referred to for historical reasons as DCSV (which originally meant "Dublin Core™ Structured Value"). |
Revision note: | 2006-04-10. After approval of the DCMI Abstract Model [DCAM] as a DCMI Recommendation in March 2005, the DCMI Usage Board undertook a review of the DCSV syntax specification and of the related specifications for the encoding schemes DCMI Box, DCMI Point, and DCMI Period, with the goal of revising their language for conformance with the Abstract Model. A summary of the changes made can be found in the document "Revision of DCSV specifications". As of 2005, the DCMI Abstract Model supports the construct "related description" as a method for describing value entities such as a persons or, indeed, time periods or locations in space. The DCMI Usage Board encourages implementers to consider using related descriptions as an alternative to packaging descriptive information in DCSV-encoded strings. Descriptions based on the DCMI Abstract Model are more likely to be interoperable over the longer term than descriptions using DCSV-syntax-based specifications. |
Table of Contents
- Introduction
- Structured Value Strings
- Parsing DCSV
- Sample Code for parsing DCSV-encoded values
- Glossary
- Acknowledgements
- References
1. Introduction
It is often desirable to encode or serialise simple structured data within a text string. Some generic methods are in common use. Borrowing conventions from natural languages, commas (,) and semi-colons (;) are frequently used as list separators. Similarly, comma-separated values (CSV) and tab-separated values (TSV) are common export formats from spreadsheet and database software, with line feeds separating rows or tuples. Dots (.) and dashes (-) are sometimes used to imply hierarchies, particularly in thesaurus applications. The eXtensible Markup Language [XML] provides a more general solution, using tags contained within angle brackets (<, >) to indicate structure.
This document describes a particular method for encoding simple structured data within a value string. In the DCMI Abstract Model [DCAM], a value string is defined as "a simple string that represents the value of a property". Value strings encoded according to the method described in this document are referred to here as structured value strings .
(Note that for historical reasons, the method itself is still referred to here as the DCSV Syntax, or DCSV. "DCSV" originally stood for "Dublin Core™ Structured Value", a legacy concept from circa 1997 which no longer has a place in today's DCMI Abstract Model [DCAM].)
As of 2006, when this specification was revised, the DCMI Usage Board encourages implementors to describe the value resource of a property more fully, where necessary, by making it the subject of an additional description. In terms of the the DCMI Abstract Model, this means creating a "related description" within the context of a "description set".
Using a related description in this way places all of the information in a description set within the context of the DCMI Abstract Model, helping to ensure that recipients of the metadata will be able to parse and understand it. In contrast, the use of DCSV-encoded strings for the description of the value resource forces recipients of the metadata to understand both the DCMI Abstract Model and the DCSV specification described here.
Despite these limitations, implementers may want to use DCSV-encoded strings in situations where the chosen encoding syntax used by the application does not support related descriptions (e.g., XHTML) or where there is a significant legacy adoption of DCSV-encoded structured value strings within a community.
2. Structured Value Strings
The DCSV Syntax allows a structured value string to be parsed into a set of components. To represent this set of components, the syntax distinguishes between two types of substring within the structured value string -- componentLabels and componentValues. A componentLabel is the name of a component within the structured data, and a _componentValue_is the data itself.
Punctuation characters are used in encoding a structured value string as follows:
- equal signs (=) separate the componentLabel from the componentValue;
- semicolons (;) separate (optionally labelled) components within a list;
- dots (.) may be used within componentLabels to indicate hierarchical or containment relationships between components.
The componentLabels and the componentValues themselves each consist of a text string. The intention is that the componentLabel will be a word or code corresponding to the name of the component. The componentLabels may be absent, in which case the entire substring delimited by semi-colons (;) or the end of the string comprises a componentValue.
The following patterns show how structured information about a resource may be recorded in strings using DCSV:
"u1; u2; u3" "cA=v1" "cA=v1; cB.part1=v2; cB.part2=v3" "cA=v1; u2; u3"
where
- u1, u2 and u3 are componentValues of unlabelled components,
- cA and cB are the componentLabels for components,
- part1 and part2 are componentLabels for components that are sub-components of the component with the componentLabel cB, and
- v1, v2, and v3 are componentValues of labelled components.
The use of specific punctuation characters in DCSV-encoded value strings means that care must be exercised if these characters are to be used directly within strings which comprise the content (either componentLabels or componentValues) of the components. For DCSV, therefore, when an equal sign (=) or a semicolon (;) is required within the componentValue, the characters are escaped using a backslash, appearing as = ;. There should be no ambiguity regarding the dot, full-stop, or period (.) within strings: when it is part of a componentLabel, a dot indicates some hierarchy; when part of a componentValue, it has the conventional meaning for the context. This method of escaping special characters largely preserves readability and the ability to enter DCSV-encoded metadata value strings easily using a text editor if required. Software written to process DCSV-encoded value strings must make the necessary substitutions.
Note that DCSV is only intended to be used for relatively simple structured information about resources.
3. DCSV syntax encoding schemes
Where DCSV-encoded structured value strings are used, this should be indicated by using a syntax encoding scheme. For example, the DCMI endorsed DCMI Period encoding scheme should be used as follows in XHTML:
<link rel="schema.DC" href="http://purl.org/dc/elements/1.1/" /> <link rel="schema.DCTERMS" href="http://purl.org/dc/terms/" /> <meta name="DC.coverage" scheme="DCTERMS.Period" content="name=The Great Depression; start=1929; end=1939;" />
Note that "DCSV" itself should not be used as a syntax encoding scheme. Implementors should use the DCSV specification to derive application-specific syntax encoding schemes where necessary.
Note also that as of 2006, for the reasons outlined above, it is unlikely that the DCMI Usage Board will endorse any new DCMI-maintained terms based on the DCSV specification.
4. Parsing DCSV
A simple method can be used to parse metadata value strings encoded according to the DCSV syntax. For a single DCSV-encoded value string:
- split the value string into a list of substrings on any unescaped semicolons (;);
if no semicolon is present, there is a single substring; - split each substring into a componentLabel-componentValue pair on any unescaped equal signs (=);
if no equal sign (=) is present, the componentLabel is empty; - within each componentValue, replace the escaped characters with the actual character required.
5. Sample Code for parsing DCSV-encoded values
The following Perl program reads a DCSV-encoded string entered on stdin and prints a formatted version of the structured result. This code is provided for demonstration purposes only and contains no error-checking.
#!/usr/local/bin/perl use strict print "Enter string to be parsed:\n"; my $string = join('',<STDIN>); print "\nString to be parsed is [$string]\n"; # First escape % characters $string =~ s/%/"%".unpack('C',"%")."%"/eg; # Next change \ escaped characters to %d% where d is the character's ascii code $string =~ s/\\(.)/"%".unpack('C',$1)."%"/eg; print "\nEscaped string is [$string]\n"; # Now split the string into components my @components = split(/;/, $string); print "\nComponents:\n"; foreach $component (@components) { my ($label, $value) = split(/=/, $component, 2); # if there is no = copy contents of $label into $value and empty $label if (!$value) { $value = $label; $name = ''; } # strip whitespace from name string $label =~ s/^\s*(\S+)\s*$/$1/; # convert % escaped characters back in label string $label =~ s/%(\d+)%/pack('C',$1)/eg; # convert % escaped characters back in value string $value =~s/%(\d+)%/pack('C',$1)/eg; print "Component Label [$label] has Component Value [$value]\n"; }
6. Glossary
This document uses the following terms:
- component
- One of a set of parts that comprise a structured value string.
- componentLabel
- A label given to a component.
- componentValue
- Data contained in the component.
- structured value
- Structured value is a term that was loosely employed in the DCMI context between 1997 and 2005 to designate a variety of multi-component entities used as "attribute values" in metadata. Strings encoded according to the syntax described in this specification were called "Dublin Core™ Structured Values", hence the acronym "DCSV".
- structured value string
- A value string that contains machine-parsable component parts (and which has an associated syntax encoding scheme indicating how the component parts are encoded within the string).
- syntax encoding scheme
- A syntax encoding scheme indicates that the value string is formatted in accordance with a formal notation, such as "2000-01-01" as the standard expression of a date.
- value
- A value is the physical or conceptual entity that is associated with a property when it is used to describe a resource.
- value representation
- A value representation is a surrogate for (i.e., a representation of) the value.
- value string
- A value string is a simple string that represents the value of a property.
6. Acknowledgments
John Kunze encouraged the original authors to write up their proposal formally, resulting in the first DCSV specification of July 2000. Kim Covil wrote the perl code. Eric Miller nagged regarding overlap with XML. Steve Tolkin convinced the original authors to switch to =.
After approval of the DCMI Abstract Model [DCAM] as a DCMI Recommendation in March 2005, the DCMI Usage Board undertook a review of this DCSV syntax specification and of the related specifications for the encoding schemes DCMI Box, DCMI Point, and DCMI Period, with the goal of revising their language for conformance with the Abstract Model [REVIEW].
7. References
[DCAM]
A. Powell, M. Nilsson,
A. Naeve, P. Johnston, 2005,
DCMI Abstract Model
http://dublincore.org/specifications/dublin-core/abstract-model/.
[REVIEW]
DCMI Usage Board
Revision of DCSV specifications
http://dublincore.org/usage/decisions/2006/2006-01.DCSV-revisions.html.
[XML]
Extensible Markup Language
http://www.w3.org/XML/.