innovation in metadata design, implementation & best practices

A Definition of Thesauri and Classification as Indexing Tools

Please be aware that this document is no longer current. For further information, please contact the author of this document or, if you have questions about the subject of this document, go to AskDCMI for an answer from one of our experts.

A Definition of Thesauri and Classification as Indexing Tools

Date Issued:
Not Applicable
Is Replaced By:
Not Applicable
Latest version:
Status of document:
This is a DCMI Note.
Description of document:

This note summarizes the definition of thesauri and classification as indexing tools and gives the distinction between entity (taxonomic) and bibliographic classification. It also explains the difference between thesauri and classification and the difference between the general classification schemes suggested for DCMES element SUBJECT: LCC, DCC and UDC.

This message summarizes the definition of thesauri and classification as indexing tools and gives the distinction between entity (taxonomic) and bibliographic classification. It also explains the difference between thesauri and classification and the difference between the general classification schemes suggested for DCMES element SUBJECT: LCC, DCC and UDC.

In my experience, and this ongoing discussion is the proof, the presumption that terminology used within DCMES and DCMES qualifiers would be clearly understood is true only for part of the DC community. Those coming from a library and information studies background don't have a problem understanding the metadata 'jargon', however the rest are struggling to put some meaning behind it.

I happen to be teaching indexing languages and doing my research on one classification and I don't think it is fair that people are struggling, checking general dictionaries to achieve some consensus while the answer is fairly straightforward. I apologize if by trying to clarify the ongoing discussion I repeat something already well understood and accepted concepts.

I try to summarize the terminology used in DCMES which has LIS provenance and is interpreted differently in that community from what others might expect.

I will be concentrating here on the terminology in DCMES qualifiers for the SUBJECT element which refers to encoding schemes such as thesauri and classifications. The examples given in DCMES, have their origin in librarianship and documentation. The meaning behind this is very strict and is not meant to be interpreted freely or understood in a general sense (e.g. every systematic ordering can be called classification and any dictionary which establishes relationships between terms can be called a thesaurus).

It is critical to understand that thesauri and classifications are used here in the specific sense of indexing languages (as it is well defined in ISO 5963 which describes recommended procedures for examining documents, determining their subjects and selecting appropriate indexing terms and is related to both ISO Guidelines for the establishment and development of monolingual thesauri ISO 2788 and multilingual thesauri - ISO 2788)

'Indexing' is understood here as the act of describing or identifying a document in terms of subject content (and NOT as a description of the document as a physical entity).

An indexing language (as with every language) consists of vocabulary and syntax rules. We very often refer to them as 'systems' like classification systems or subject headings systems.

Generally speaking an 'indexing term' (i.e. the representation of the concepts representing a theme in a document) is suggested to be used from:

a) dictionaries or encyclopaedias recognized as authorities in their field

b) thesauri

c) classification schemes

There are two major types of indexing languages:

a) those using natural language terms or words (thesauri, subject headings systems)

b) those using symbols: numbers, letters or combination of those (bibliographic classification)

NB: The difference between these is not in the display, as thesauri when properly designed will have both an alphabetical listing and a systematic or classified display. Furthermore every bibliographic classification has established relations between concepts so relationship between concepts is not particular to thesauri only. Also, every bibliographic classification provides some sort of alphabetical subject index, and sometimes even a thesaurus to enable concept location. There is also an indexing tool called a "classaurus" which is a hybrid of both (see Bhattacharyya: Classaurus its fundamentals, design and use, 1982).

In practical terms, therefore, the most obvious distinction is what you actually use as the indexing term: symbols or words (other differences are out of the scope of this discussion).

Classification is, it appears, more confusing for non-librarians. There are many classification schemes and there are also many classifications of knowledge which can be confused with bibliographic or documentary classifications as they can both cover a specific field of knowledge or the whole system of knowledge. However, there are some reasons to limit qualifiers in DCMES ' SUBJECT field on bibliographic classification only.

The main difference between classification of knowledge and bibliographic classification is that bibliographic/documentary classifications both special, such as NLM, INSPEC etc. or general, such as Dewey Decimal Classification, Library of Congress Classification, Universal Decimal Classification, are indexing systems designed to deal with knowledge recorded in documents. They are usually able to express not just knowledge, but the form in which that knowledge is recorded, the language in which it is presented and many features particular to the instantiation of the particular subject within some document like object.

Knowledge classification can be, and often is, TAXONOMIC (sometimes called 'entity classification') like the classification of zoology, classification of plants, or classification of chemical elements (which means that they are going to list one concept in one place only in the classification structure).

Bibliographic classifications i.e. those one has to use to describe real documents ARE NOT and CAN NOT be taxonomic. They are by all means ASPECT or disciplinary classifications. This means that they will list one concept in all disciplines and fields where that concept might be studied: e.g. 'water' will have to appear under chemistry, physics, in geology, medicine, sport etc.

This is of critical importance for information retrieval as aspect classification helps to establish the context in which one concept or phenomenon might be studied within the document.

General (we have special as well) bibliographic classifications are existing systems, with big vocabularies (DDC has over 20000 terms, UDC 61000, LCC several hundred thousand). These systems give provision to describe not only subject, but the form in which it is presented, the time and place that subject is connected, the language it is presented in the document, the physical quality of the carrier etc.

Some of these classification schemes have a hierarchical structure (like DDC and UDC), some list both single and composed concepts and are basically enumerating all possible subjects predicted to be studied in the documents (like LCC). Some, however, tend to have faceted feature i.e. to be synthetic, like UDC, enabling expression of an infinite number of subject combinations in the documents. These classification systems are widely accepted and used in hundreds of countries and translated into many languages - that is the reason they are suggested in DCMES.

They have been maintained and developed by institutions or bodies that own the publishing and translation rights. "Maintained" means that new concepts are constantly being added to follow the growth of knowledge. The three above mentioned schemes are available in electronic form as well.

As they use symbols rather than words they are especially suitable for the multilingual environment of the Internet. They can be used as the basis for developing thesauri or for building and tailoring a list of indexing terms for the ones specific purposes. They can be used to describe any object not just textual. Classification can be, such as UDC, designed and suitable for information retrieval and not only for shelf-arrangement. In countries other then USA classification is used both for classified catalogues and shelf arrangement. While in American tradition it as always regarding as "mark and park" tool (both DDC and LCC are designed to serve for that), most European countries have rich tradition in using classification as language independent indexing tool.