$Revision: 1.5 $ on $Date: 2006/07/31 12:50:49 $ by Alistair Miles.

All parts and revisions of this document can be obtained from http://isegserv.itd.rl.ac.uk/cvs-public/skos/press/dc2006/.

Please note that this document has been superseded, and is maintained for historical purposes only. The final (current) version of this paper is available from http://isegserv.itd.rl.ac.uk/public/skos/press/dc2006/camera-ready-paper.pdf (to be presented at the 2006 International Conference on Dublin Core and Metadata Applications).


SKOS: Requirements for Standardization

To Main Paper

Appendix J. Thesaurus Representation Issues

This appendix describes specific issues relating to the representation of thesauri in RDF using the SKOS Core Vocabulary 2nd Public Working Draft edition [SKOSCoreGuide, SKOSCoreSpec]. Only "standard" thesauri are considered here, and by "standard" are meant those thesauri constructed and maintained in a manner that is consistent with ISO 2788:1986 [ISO2788] and/or BS 8723-2 [BS8723-2]. Because these standards have been subject to ambiguous interpretation, the analysis here is based on the model described in the following two paragraphs.

A "standard thesaurus" is constructed and managed as a set of "terms", where a "term" can be either "preferred" or "non-preferred". A "preferred term" is also knowns as a "descriptor", and a "non-preferred term" is also known as a "non-descriptor". The set of descriptors is also known as the "indexing vocabulary", because in a traditional card index only the descriptors may be used in the value of the "subject" field. The set of non-descriptors is also known as the "entry vocabulary", because each non-descriptor is linked to a descriptor, and therefore the non-descriptors help indexers and searchers find their way to the appropriate descriptor to use in indexing and searching.

The relationship between a non-descriptor and its corresponding descriptor is traditionally denoted by the imperative "USE" (sometimes abbreviated as "US"), and the inverse relationship denoted by "USED FOR" (abbreviated as "UF"). The descriptors may be organised into a generalisation hierarchy, and the generalisation and specialisation relationships are usually denoted by "broader" and "narrower" respectively (abbreviated as "BT" and "NT"). Associative links may also be made between descriptors, and these are denoted by "related" (abbreviated as "RT").

Below is an extract from a standard thesaurus, rendered in the traditional thesaurus style, showing a single descriptor and a single non-descriptor.

love UF affection

affection USE love


Each descriptor in a thesaurus establishes a distinct, unique meaning. Therefore, when creating an RDF/SKOS representation of a thesaurus, a resource of type skos:Concept is declared for each descriptor in the thesaurus. Each such resource should be allocated a URI. The actual lexical value of the descriptor is mapped to a value of the skos:prefLabel property in the appropriate language. The lexical value of any associated non-descriptors is mapped to values of the skos:altLabel and/or skos:hiddenLabel properties in the appropriate language.

For example, the extract above is represented in RDF/SKOS as (all RDF examples in this document given in the Turtle serialisation syntax):

@prefix v: <http://www.example.com/vocab#>.
@prefix skos: <http://www.w3.org/2004/02/skos/core#>.

v:A001 rdf:type skos:Concept;
  skos:prefLabel "love"@en;
  skos:altLabel "affection"@en.

N.B. In the example above a URI has been constructed by appending an arbitrarily chosen alphanumeric string ("A001" - the "local name") to a base URI ("http://www.example.com/vocab#" - the "namespace URI").

Whether or not the issues described below must be resolved during the standardization of SKOS depends on whether the ability to recover a traditional thesaurus-style representation from an RDF/SKOS representation is chosen to be a mandatory requirement.

Annotations on Non-Descriptors

In some thesauri, annotations (i.e. documentation, notes) of various types may be associated with both descriptors and non-descriptors. Annotations that are associated with a descriptor may be mapped to RDF statements about the conceptual resource, for example the following thesaurus extract:

love
  UF affection
  SN A strong feeling of affection or attraction towards another person.

... can be represented in RDF/SKOS as:

@prefix v: <http://www.example.com/vocab#>.
@prefix skos: <http://www.w3.org/2004/02/skos/core#>.

v:A001 rdf:type skos:Concept;
  skos:prefLabel "love"@en;
  skos:altLabel "affection"@en;
  skos:scopeNote "A strong feeling of affection or attraction towards another person."@en.

However, there is currently no support in the SKOS Core Vocabulary for annotations that are associated with a non-descriptor. For example, consider the following thesaurus extract:

grinding house
  UFO grinding mill
  SN A place where material is crushed.

grindery
  UFO grinding mill
  SN A place where metal objects are sharpened.

grinding mill
  USE grinding house OR grindery
  SN Use "grinding house" for a place material is crushed and "grindery" for a place 
     where metal objects are sharpened.

It is reasonable to create the following SKOS/RDF representation:

@prefix v: <http://www.example.com/vocab#>.
@prefix skos: <http://www.w3.org/2004/02/skos/core#>.

v:C001 rdf:type skos:Concept;
  skos:prefLabel "grinding house"@en;
  skos:altLabel "grinding mill"@en;
  skos:scopeNote "A place where material is crushed."@en.

v:C002 rdf:type skos:Concept;
  skos:prefLabel "grindery"@en;
  skos:altLabel "grinding mill"@en;
  skos:scopeNote "A place where metal objects are sharpened."@en.

... however the scope note associated with the non-descriptor grinding mill has been lost, and therefore the original thesaurus-style representation cannot be recovered from the RDF/SKOS representation.

The simplest workaround for this problem is to invent a new RDF property that can be used to associate an annotation with a specific alternative label. I.e. if we call this invented property skos:annotatesLabel, then the above SKOS/RDF representation could be modified as follows:

@prefix v: <http://www.example.com/vocab#>.
@prefix skos: <http://www.w3.org/2004/02/skos/core#>.

v:C001 rdf:type skos:Concept;
  skos:prefLabel "grinding house"@en;
  skos:altLabel "grinding mill"@en;
  skos:scopeNote "A place where material is crushed."@en;
  skos:scopeNote _:aaa.

v:C002 rdf:type skos:Concept;
  skos:prefLabel "grindery"@en;
  skos:altLabel "grinding mill"@en;
  skos:scopeNote "A place where metal objects are sharpened."@en;
  skos:scopeNote _:aaa.

_:aaa 
  rdf:value "Use 'grinding house' for a place material is crushed and 'grindery' 
             for a place where metal objects are sharpened."@en; 
  skos:annotatesLabel "grinding mill"@en. 

This new representation is slightly clumsy, because the scope note is referenced from two conceptual resources, however it does allow the original thesaurus-style representation to be recovered.

A variant of this workaround involves also introducing a number of new RDFS classes to express the type of an annotation, so the function of an annotation does not have to be expressed by its relationship to a conceptual resource. For example, if SKOS Core included a skos:ScopeNote class, the above RDF representation could be modified as follows:

@prefix v: <http://www.example.com/vocab#>.
@prefix skos: <http://www.w3.org/2004/02/skos/core#>.

v:C001 rdf:type skos:Concept;
  skos:prefLabel "grinding house"@en;
  skos:altLabel "grinding mill"@en.

v:C002 rdf:type skos:Concept;
  skos:prefLabel "grindery"@en;
  skos:altLabel "grinding mill"@en.

[] rdf:type skos:ScopeNote;
  rdf:value "Use 'grinding house' for a place material is crushed and 'grindery'
             for a place where metal objects are sharpened."@en; 
  skos:annotatesLabel "grinding mill"@en. 

[] rdf:type skos:ScopeNote;
  rdf:value "A place where material is crushed."@en; 
  skos:annotatesConcept v:C001. 

[] rdf:type skos:ScopeNote;
  rdf:value "A place where metal objects are sharpened."@en; 
  skos:annotatesConcept v:C002. 


This type of representation also allows for an extra dimension of flexibility, because different properties could be used to give the plain text, XHTML, MathML or other value of the annotation, e.g. skos:annotationText and skos:annotationXHTML in place of the use of rdf:value as shown above. However, this pattern introduces an additional level of complexity, because an annotation then requires at least three RDF statements whereas previously it required only one.

Alternative Descriptors

The example in the section above also illustrates another representation issue. In the above example, a non-descriptor is associated with two alternative descriptors, whereas usually a non-descriptor may only be associate with a single descriptor. The instruction is given as "USE x OR y", and the reverse instruction is given as "UFO" instead of the usual "UF".

By allowing two conceptual resources to share the same alternative lexical label in the RDF/SKOS representation, as shown in the above example, the original thesaurus-style representation may be recovered. However, this situation also interacts with the following issue, and the representation used can lead to ambiguity.

Instructions for Coordination

Some thesauri contain non-decriptors that are associated with two or more descriptors, where the intention is that the descriptors should be used in combination, i.e. they should be coordinated by the searcher. The instruction is given as "USE x + y" or "USE x AND y", and the reverse instruction is given as "UF+" instead of the usual "UF".

For example, consider the following extract from a thesaurus:

coal mining USE coal + mining

coal UF+ coal mining

mining UF+ coal mining

If we use the following RDF/SKOS representation:

@prefix v: <http://www.example.com/vocab#>.
@prefix skos: <http://www.w3.org/2004/02/skos/core#>.

v:D001 rdf:type skos:Concept;
  skos:prefLabel "coal"@en;
  skos:altLabel "coal mining"@en.

v:D002 rdf:type skos:Concept;
  skos:prefLabel "mining"@en;
  skos:altLabel "coal mining"@en.

... it is not possible to recover the original representation, because there is an ambiguity with the representation of alternative descriptors as described above. Appendix H. Semantic Coordination proposes a possible solution to this problem.

Change Management

A significant issue in the RDF/SKOS representation of thesauri involves the problem of change management. This issue is perhaps more serious than those described above, because whereas the above issues only affect the ability to produce certain types of visualisation (in this case a traditional thesaurus-style representation), the problem of change management impacts directly on the behaviour of information retrieval applications.

Thesauri are generally constructed and managed as a collection of "terms". A "term" is usually given a "candidate" status for a period of time before it is fully accepted into the thesaurus proper. A "term" that starts out as a descriptor may become a non-descriptor, and vice versa. The usual method of deprecating a descriptor is to make it a non-descriptor, and assert a "USE" link to the new descriptor replacing it. The only method of deprecating a non-descriptor is to remove it from the thesaurus.

The difficulty is that no correlation is made between the changes made to the "terms", and the changes in meaning made to the underlying conceptual units of the vocabulary. For example, consider the thesaurus extract:

love UF affection

affection USE love

Here, love is a descriptor and affection is a non-descriptor. A change is made such that these are swapped for each other, and the thesaurus becomes:

affection UF love

love USE affection    

This change is quite reasonable, and may not be intended to change the meaning of the underlying conceptual unit at all.

In the SKOS/RDF representation of this thesaurus, a change could simply be made to the labels of a single conceptual resource, i.e. the following:

@prefix v: <http://www.example.com/vocab#>.
@prefix skos: <http://www.w3.org/2004/02/skos/core#>.

v:A001 rdf:type skos:Concept;
  skos:prefLabel "love"@en;
  skos:altLabel "affection"@en.

... could simply become:

@prefix v: <http://www.example.com/vocab#>.
@prefix skos: <http://www.w3.org/2004/02/skos/core#>.

v:A001 rdf:type skos:Concept;
  skos:prefLabel "affection"@en;
  skos:altLabel "love"@en.

This seems straightforward. However, consider the following situation. A thesaurus begins as:

emotion
  UF love
  UF affection

love USE emotion

affection USE emotion 

... and is then changed to:

emotion
  NT love
  NT affection
 
love BT emotion

affection BT emotion
        

Effectively, two new conceptual units have been established in the thesaurus by this change. Therefore the original RDF/SKOS representation:

@prefix v: <http://www.example.com/vocab#>.
@prefix skos: <http://www.w3.org/2004/02/skos/core#>.

v:A007 rdf:type skos:Concept;
  skos:prefLabel "emotion"@en;
  skos:altLabel "love"@en.
  skos:altLabel "affection"@en.

... becomes:

@prefix v: <http://www.example.com/vocab#>.
@prefix skos: <http://www.w3.org/2004/02/skos/core#>.

v:A007 rdf:type skos:Concept;
  skos:prefLabel "emotion"@en;
  skos:narrower v:A008;
  skos:narrower v:A009.

v:A008 rdf:type skos:Concept;
  skos:prefLabel "love"@en;
  skos:broader v:A007.

v:A009 rdf:type skos:Concept;
  skos:prefLabel "affection"@en;
  skos:broader v:A007.

How should information retrieval applications respond to this change?

Consider a final example. A thesaurus begins as:

Countries of the European Union
  SN The 15 member states.

... and is changed to:

Countries of the European Union
  SN The 25 member states.

Because the meaning of the underlying conceptual unit has changed significantly, in order to allow information retrieval applications to respond in some way a new conceptual resource must be identified. I.e. the original SKOS/RDF representation:

@prefix v: <http://www.example.com/vocab#>.
@prefix skos: <http://www.w3.org/2004/02/skos/core#>.

v:F001 rdf:type skos:Concept;
  skos:prefLabel "Countries of the European Union"@en;
  skos:scopeNote "The 15 member states."@en.

... is replaced with:

@prefix v: <http://www.example.com/vocab#>.
@prefix skos: <http://www.w3.org/2004/02/skos/core#>.

v:F002 rdf:type skos:Concept;
  skos:prefLabel "Countries of the European Union"@en;
  skos:scopeNote "The 25 member states."@en.

To affect an appropriate change in information retrieval behaviour, the original conceptual resource v:F001 is removed from the vocabulary, or indicated to be deprecated in some way (this is not currently supported in SKOS), and a new conceptual resource v:F002 is declared. Additionally, a mapping graph could be created { v:F002 skos:narrower v:F001. } and used by retrieval applications, in conjunction with a query expansion algorithm such as described by Appendix D. Example Query Expansion Algorithm, to adjust retrieval behaviour accordingly.

The point of this discussion has been to illustrate the difference between the way in which change in a thesaurus is traditionally managed, and the ways in which a corresponding SKOS/RDF representation changes. A thesaurus is typically managed as a collection of "terms", whereas a SKOS/RDF representation of a vocabulary is best managed as a collection of conceptual units. By managing a controlled structured vocabulary as a set of conceptual units, specific and controlled changes to the behaviour of information retrieval applications may be effected, as illustrated above, and this is at best difficult with the traditional management style and corresponding representation frameworks. However, if SKOS is designed to require a style of management that is significantly different from the management model embedded in standards such as ISO 2788 and BS 8723-2, this is likely to significantly retard its own acceptance as a standard.

Node Labels

Some thesauri employ a third type of entity in addition to decriptors and non-descriptors: "node labels". A "node label" is always associated with a group of descriptors, known in BS 8723-2 as an "array". The term "array" can be misleading because the order of the members of a group is not necessarily significant. The purpose of "arrays" and "node labels" is to provide additional organisational structure to the thesaurus, which in turn can aid indexers and searchers in locating appropriate descriptors.

For example, the following is an extract from a thesaurus that uses node labels and arrays:

people
  <people by age>
    infants
    children
    adolescents
    adults
    elderly people
  <people by employment status>
    employed people
    unemployed people

In a traditional thesaurus-style representation, node labels are usually shown enclosed by angle brackets. In the example above, the order of the members of the first array is significant (increasing age), but the order of the members of the second array is not.

Note that BS 8723-2 specifies that an array should only ever be used to introduce a characteristic of division. E.g. in the above, age is the characteristic of division in the first array, and employment status is the characteristic of division in the second. However, some thesauri do not adhere to this recommendation, and use arrays and node labels as general purpose grouping constructs. For example, the AAT [@@TODO ref] employs the notion of a "guide term", which corresponds to an extended notion of a node label. E.g. from the AAT:

Sound devices
  <sound devices by acoustical characteristics>
    aerophones
    chordophones
    electrophones
    ...
  <sound devices by function>
    <ambient sound makers>
      ...
    loudspeakers
    metronomes
    ...
  <sound modifying devices>
    kazoos
    megaphones

In the above example, some "guide terms" are consistent with the BS 8723-2 notion of a "node label" in that they introduce a characteristic of division (e.g. "sound devices by acoustical characteristics" and "sound devices by function") and some are not (e.g. "sound modifying devices").

The above example also illustrates the nesting of arrays, which is found in some thesauri.

The AAT notion of a "guide term" and the BS 8723-2 notion of a "node label" do however share the following assumption, which is that a "node label"/"guide term" should not be used in indexing or in searching. Node labels are purely a navigational aid, providing a means to organise a long list of sibling descriptors in a systematic way.

SKOS Core currently supports the RDF representation of these structures via the skos:Collection skos:OrderedCollection and skos:CollectableProperty classes and the skos:member and skos:memberList properties. However, there is a serious contradiction in the recommended usage given in the SKOS Core Guide. The intention behind creating the class skos:Collection was that this class would be disjoint with the class skos:Concept, although this has not been formally declared. I.e. a "concept" and a "grouping of concepts" are fundamentally different things. However, the usage of semantic relation properties such as skos:narrower to link a skos:Concept to a skos:Collection, as illustrated in the SKOS Core Guide, leads to the inference under RDFS entailment that a resource is both a skos:Concept and a skos:Collection, which is inconsistent with the intended disjointness between these classes.

Note that, because a "node label" is never involved in indexing or in user queries, this structure is essentially irrelevant for information retrieval applications. It is used only to generate representations of the thesaurus for human consumption. An application attempting to compute a query expansion (e.g. via the algorithm given in Appendix D. Example Query Expansion Algorithm) needs only the graph of direct semantic links between resources of type skos:Concept.

@@TODO workaround ...


--- Change Log ---

$Log: thesaurus.html,v $
Revision 1.5  2006/07/31 12:50:49  ajm65
Minor spelling.

Revision 1.4  2006/07/31 12:41:19  ajm65
Added deprecation note.

Revision 1.3  2006/05/03 12:46:37  ajm65
Added discussion of node label issue, workarounds still todo.

Revision 1.2  2006/05/02 15:16:35  ajm65
First completed draft.

Revision 1.1  2006/04/29 14:25:10  ajm65
Initial.


---