DCMI DCSV
DCMI DCSV: A syntax for representing simple structured data in a text string
Creator: | Simon Cox |
---|---|
Creator: | Renato Iannella |
Contributor: | Andy Powell |
Contributor: | Andrew Wilson |
Contributor: | Pete Johnston |
Contributor: | Thomas Baker |
Date Issued: | 2006-02-13 |
Identifier: | http://dublincore.org/specifications/dublin-core/dcmi-dcsv/2006-02-13/ |
Replaces: | http://dublincore.org/specifications/dublin-core/dcmi-dcsv/2005-07-25/ |
Is Replaced By: | http://dublincore.org/specifications/dublin-core/dcmi-dcsv/2006-04-10/ |
Latest version: | http://dublincore.org/specifications/dublin-core/dcmi-dcsv/ |
Status of document: | This is a DCMI Proposed Recommendation. |
Description of document: | This document describes a method for recording simple structured data in a text string, or structured value string. This method is referred to for historical reasons as DCSV (which originally meant "Dublin Core™ Structured Value"). |
Revision note: | 2006-02-13. After approval of the DCMI Abstract Model [DAM] as a DCMI Recommendation in March 2005, the DCMI Usage Board undertook a review of the DCSV syntax specification and of the related specifications for the encoding schemes DCMI Box, DCMI Point, and DCMI Period, with the goal of revising their language for conformance with the Abstract Model. A summary of the changes made can be found in the document "Revision of legacy DCSV specifications". As of 1995, the DCMI Abstract Model supports the representation of complex structures, such as those encoded in DCSV-syntax-based encoding schemes, as "related descriptions". The DCMI Usage Board encourages implementers to consider the longer-term consequences for interoperability of packaging structured information in parsable DCSV-encoded string values as opposed to conveying that information in related descriptions using other syntax encodings. |
Table of Contents
- Introduction
- Structured Value Strings
- Parsing DCSV
- Sample Code for parsing DCSV-encoded values
- Glossary
- Acknowledgements
- References
1. Introduction
It is often desirable to encode or _serialise_simple structured data within a text string. Some generic methods are in common use. Borrowing conventions from natural languages, commas (,) and semi-colons (;) are frequently used as list separators. Similarly, comma-separated values (CSV) and tab-separated values (TSV) are common export formats from spreadsheet and database software, with _line feeds_separating rows or tuples. Dots (.) and dashes (-) are sometimes used to imply hierarchies, particularly in thesaurus applications. The eXtensible Markup Language [XML] provides a more general solution, using tags contained within angle brackets (<, >) to indicate structure.
2. Structured Value Strings
This document describes a particular method for encoding simple structured data within a value string. In the DCMI Abstract Model [DAM], a value string is defined as "a simple string that represents the value of a property". Value strings encoded according to the method described in this document are referred to here as structured value strings.
(Note that for historical reasons, the method itself is still referred to here as the DCSV Syntax, or DCSV. "DCSV" originally stood for "Dublin Core™ Structured Value", a legacy concept from circa 1997 which no longer has a place in today's DCMI Abstract Model [DAM].)
The DCSV Syntax allows a _structured value string_to be parsed into a set of components. To represent this set of components, the syntax distinguishes between two types of substring within the structured value string --componentLabels and componentValues. A componentLabel is the name of a _component_within the structured data, and a _componentValue_is the data itself.
Punctuation characters are used in encoding a structured value string as follows:
- equal signs (=) separate the componentLabel from the componentValue;
- semicolons (;) separate (optionally labelled) components within a list;
- dots (.) may be used within componentLabels to indicate hierarchical or containment relationships between components.
The componentLabels and the componentValues themselves each consist of a text string. The intention is that the_componentLabel_ will be a word or code corresponding to the name of the component. The componentLabels may be absent, in which case the entire substring delimited by semi-colons (;) or the end of the string comprises a componentValue.
The following patterns show how structured information about a resource may be recorded in strings using DCSV:
"u1; u2; u3" "cA=v1" "cA=v1; cB.part1=v2; cB.part2=v3" "cA=v1; u2; u3"
where
- u1, u2 and u3 are componentValues of unlabelled components,
- cA and cB are the componentLabels for components,
- part1 and part2 are componentLabels for components that are sub-components of the component with the componentLabel cB, and
- v1, v2, and v3 are componentValues of labelled components.
The use of specific punctuation characters in DCSV-encoded_value strings_ means that care must be exercised if these characters are to be used directly within strings which comprise the content (either componentLabels or_componentValues_) of the components. For DCSV, therefore, when an equal sign (=) or a semicolon (;) is required within the componentValue, the characters are escaped using a backslash, appearing as = ;. There should be no ambiguity regarding the dot, full-stop, or period (.) within strings: when it is part of a_componentLabel_, a dot indicates some hierarchy; when part of a componentValue, it has the conventional meaning for the context. This method of escaping special characters largely preserves readability and the ability to enter DCSV-encoded metadata value strings easily using a text editor if required. Software written to process DCSV-encoded value strings must make the necessary substitutions.
Note that DCSV is only intended to be used for relatively simple structured information about resources.
3. Parsing DCSV
A simple method can be used to parse metadata value strings encoded according to the DCSV syntax. For a single DCSV-encoded value string:
- split the value string into a list of substrings on any unescaped semicolons (;);
if no semicolon is present, there is a single substring; - split each substring into a componentLabel-componentValue pair on any unescaped equal signs (=);
if no equal sign (=) is present, the _componentLabel_is empty; - within each componentValue, replace the escaped characters with the actual character required.
4. Sample Code for parsing DCSV-encoded values
The following Perl program reads a DCSV-encoded string entered on stdin and prints a formatted version of the structured result. This code is provided for demonstration purposes only and contains no error-checking.
#!/usr/local/bin/perl use strict print "Enter string to be parsed:\n"; my $string = join('',<STDIN>); print "\nString to be parsed is [$string]\n"; # First escape % characters $string =~ s/%/"%".unpack('C',"%")."%"/eg; # Next change \ escaped characters to %d% where d is the character's ascii code $string =~ s/\\(.)/"%".unpack('C',$1)."%"/eg; print "\nEscaped string is [$string]\n"; # Now split the string into components my @components = split(/;/, $string); print "\nComponents:\n"; foreach $component (@components) { my ($label, $value) = split(/=/, $component, 2); # if there is no = copy contents of $label into $value and empty $label if (!$value) { $value = $label; $name = ''; } # strip whitespace from name string $label =~ s/^\s*(\S+)\s*$/$1/; # convert % escaped characters back in label string $label =~ s/%(\d+)%/pack('C',$1)/eg; # convert % escaped characters back in value string $value =~s/%(\d+)%/pack('C',$1)/eg; print "Component Label [$label] has Component Value [$value]\n"; }
5. Glossary
This document uses the following terms:
- component
- One of a set of parts that comprise a structured value string.
- componentLabel
- A label given to a component.
- componentValue
- Data contained in the component.
- structured value
- Structured value is a term that was loosely employed in the DCMI context between 1997 and 2005 to designate a variety of multi-component entities used as "attribute values" in metadata. Strings encoded according to the syntax described in this specification were called "Dublin Core™ Structured Values", hence the acronym "DCSV".
- structured value string
- A value string that contains machine-parsable component parts (and which has an associated syntax encoding scheme indicating how the component parts are encoded within the string).
- syntax encoding scheme
- A syntax encoding scheme indicates that the value string is formatted in accordance with a formal notation, such as "2000-01-01" as the standard expression of a date.
- value
- A value is the physical or conceptual entity that is associated with a property when it is used to describe a resource.
- value representation
- A value representation is a surrogate for (i.e., a representation of) the value.
- value string
- A value string is a simple string that represents the value of a property.
6. Acknowledgments
John Kunze encouraged the original authors to write up their proposal formally, resulting in the first DCSV specification of July 2000. Kim Covil wrote the perl code. Eric Miller nagged regarding overlap with XML. Steve Tolkin convinced the original authors to switch to =.
7. References
[DAM]
A. Powell, M. Nilsson, A. Naeve, P. Johnston, 2005, DCMI Abstract Model
http://dublincore.org/specifications/dublin-core/abstract-model/.
[XML]
Extensible Markup Language
http://www.w3.org/XML/.