$Revision: 1.5 $ on $Date: 2006/07/31 12:50:49 $ by Alistair Miles.
All parts and revisions of this document can be obtained from http://isegserv.itd.rl.ac.uk/cvs-public/skos/press/dc2006/.
Please note that this document has been superseded, and is maintained for historical purposes only. The final (current) version of this paper is available from http://isegserv.itd.rl.ac.uk/public/skos/press/dc2006/camera-ready-paper.pdf (to be presented at the 2006 International Conference on Dublin Core and Metadata Applications).
This appendix describes specific issues relating to the representation of thesauri in RDF using the SKOS Core Vocabulary 2nd Public Working Draft edition [SKOSCoreGuide, SKOSCoreSpec]. Only "standard" thesauri are considered here, and by "standard" are meant those thesauri constructed and maintained in a manner that is consistent with ISO 2788:1986 [ISO2788] and/or BS 8723-2 [BS8723-2]. Because these standards have been subject to ambiguous interpretation, the analysis here is based on the model described in the following two paragraphs.
A "standard thesaurus" is constructed and managed as a set of "terms", where a "term" can be either "preferred" or "non-preferred". A "preferred term" is also knowns as a "descriptor", and a "non-preferred term" is also known as a "non-descriptor". The set of descriptors is also known as the "indexing vocabulary", because in a traditional card index only the descriptors may be used in the value of the "subject" field. The set of non-descriptors is also known as the "entry vocabulary", because each non-descriptor is linked to a descriptor, and therefore the non-descriptors help indexers and searchers find their way to the appropriate descriptor to use in indexing and searching.
The relationship between a non-descriptor and its corresponding descriptor is traditionally denoted by the imperative "USE" (sometimes abbreviated as "US"), and the inverse relationship denoted by "USED FOR" (abbreviated as "UF"). The descriptors may be organised into a generalisation hierarchy, and the generalisation and specialisation relationships are usually denoted by "broader" and "narrower" respectively (abbreviated as "BT" and "NT"). Associative links may also be made between descriptors, and these are denoted by "related" (abbreviated as "RT").
Below is an extract from a standard thesaurus, rendered in the traditional thesaurus style, showing a single descriptor and a single non-descriptor.
love UF affection affection USE love
Each descriptor in a thesaurus establishes a distinct, unique meaning.
Therefore, when creating an RDF/SKOS representation of a thesaurus, a
resource of type skos:Concept is declared for each descriptor in
the thesaurus. Each such resource should be allocated a URI. The actual
lexical value of the descriptor is mapped to a value of the
skos:prefLabel property in the appropriate language. The lexical
value of any associated non-descriptors is mapped to values of the
skos:altLabel and/or skos:hiddenLabel properties in
the appropriate language.
For example, the extract above is represented in RDF/SKOS as (all RDF examples in this document given in the Turtle serialisation syntax):
@prefix v: <http://www.example.com/vocab#>. @prefix skos: <http://www.w3.org/2004/02/skos/core#>. v:A001 rdf:type skos:Concept; skos:prefLabel "love"@en; skos:altLabel "affection"@en.
N.B. In the example above a URI has been constructed by appending an arbitrarily chosen alphanumeric string ("A001" - the "local name") to a base URI ("http://www.example.com/vocab#" - the "namespace URI").
Whether or not the issues described below must be resolved during the standardization of SKOS depends on whether the ability to recover a traditional thesaurus-style representation from an RDF/SKOS representation is chosen to be a mandatory requirement.
In some thesauri, annotations (i.e. documentation, notes) of various types may be associated with both descriptors and non-descriptors. Annotations that are associated with a descriptor may be mapped to RDF statements about the conceptual resource, for example the following thesaurus extract:
love UF affection SN A strong feeling of affection or attraction towards another person.
... can be represented in RDF/SKOS as:
@prefix v: <http://www.example.com/vocab#>. @prefix skos: <http://www.w3.org/2004/02/skos/core#>. v:A001 rdf:type skos:Concept; skos:prefLabel "love"@en; skos:altLabel "affection"@en; skos:scopeNote "A strong feeling of affection or attraction towards another person."@en.
However, there is currently no support in the SKOS Core Vocabulary for annotations that are associated with a non-descriptor. For example, consider the following thesaurus extract:
grinding house
UFO grinding mill
SN A place where material is crushed.
grindery
UFO grinding mill
SN A place where metal objects are sharpened.
grinding mill
USE grinding house OR grindery
SN Use "grinding house" for a place material is crushed and "grindery" for a place
where metal objects are sharpened.
It is reasonable to create the following SKOS/RDF representation:
@prefix v: <http://www.example.com/vocab#>. @prefix skos: <http://www.w3.org/2004/02/skos/core#>. v:C001 rdf:type skos:Concept; skos:prefLabel "grinding house"@en; skos:altLabel "grinding mill"@en; skos:scopeNote "A place where material is crushed."@en. v:C002 rdf:type skos:Concept; skos:prefLabel "grindery"@en; skos:altLabel "grinding mill"@en; skos:scopeNote "A place where metal objects are sharpened."@en.
... however the scope note associated with the non-descriptor
grinding mill has been lost, and therefore the original
thesaurus-style representation cannot be recovered from the RDF/SKOS
representation.
The simplest workaround for this problem is to invent a new RDF property
that can be used to associate an annotation with a specific alternative
label. I.e. if we call this invented property
skos:annotatesLabel, then the above SKOS/RDF representation
could be modified as follows:
@prefix v: <http://www.example.com/vocab#>.
@prefix skos: <http://www.w3.org/2004/02/skos/core#>.
v:C001 rdf:type skos:Concept;
skos:prefLabel "grinding house"@en;
skos:altLabel "grinding mill"@en;
skos:scopeNote "A place where material is crushed."@en;
skos:scopeNote _:aaa.
v:C002 rdf:type skos:Concept;
skos:prefLabel "grindery"@en;
skos:altLabel "grinding mill"@en;
skos:scopeNote "A place where metal objects are sharpened."@en;
skos:scopeNote _:aaa.
_:aaa
rdf:value "Use 'grinding house' for a place material is crushed and 'grindery'
for a place where metal objects are sharpened."@en;
skos:annotatesLabel "grinding mill"@en.
This new representation is slightly clumsy, because the scope note is referenced from two conceptual resources, however it does allow the original thesaurus-style representation to be recovered.
A variant of this workaround involves also introducing a number of new
RDFS classes to express the type of an annotation, so the function
of an annotation does not have to be expressed by its relationship to a
conceptual resource. For example, if SKOS Core included a
skos:ScopeNote class, the above RDF representation could be
modified as follows:
@prefix v: <http://www.example.com/vocab#>.
@prefix skos: <http://www.w3.org/2004/02/skos/core#>.
v:C001 rdf:type skos:Concept;
skos:prefLabel "grinding house"@en;
skos:altLabel "grinding mill"@en.
v:C002 rdf:type skos:Concept;
skos:prefLabel "grindery"@en;
skos:altLabel "grinding mill"@en.
[] rdf:type skos:ScopeNote;
rdf:value "Use 'grinding house' for a place material is crushed and 'grindery'
for a place where metal objects are sharpened."@en;
skos:annotatesLabel "grinding mill"@en.
[] rdf:type skos:ScopeNote;
rdf:value "A place where material is crushed."@en;
skos:annotatesConcept v:C001.
[] rdf:type skos:ScopeNote;
rdf:value "A place where metal objects are sharpened."@en;
skos:annotatesConcept v:C002.
This type of representation also allows for an extra dimension of
flexibility, because different properties could be used to give the plain
text, XHTML, MathML or other value of the annotation, e.g.
skos:annotationText and skos:annotationXHTML in
place of the use of rdf:value as shown above. However, this
pattern introduces an additional level of complexity, because an annotation
then requires at least three RDF statements whereas previously it required
only one.
The example in the section above also illustrates another representation issue. In the above example, a non-descriptor is associated with two alternative descriptors, whereas usually a non-descriptor may only be associate with a single descriptor. The instruction is given as "USE x OR y", and the reverse instruction is given as "UFO" instead of the usual "UF".
By allowing two conceptual resources to share the same alternative lexical label in the RDF/SKOS representation, as shown in the above example, the original thesaurus-style representation may be recovered. However, this situation also interacts with the following issue, and the representation used can lead to ambiguity.
Some thesauri contain non-decriptors that are associated with two or more descriptors, where the intention is that the descriptors should be used in combination, i.e. they should be coordinated by the searcher. The instruction is given as "USE x + y" or "USE x AND y", and the reverse instruction is given as "UF+" instead of the usual "UF".
For example, consider the following extract from a thesaurus:
coal mining USE coal + mining coal UF+ coal mining mining UF+ coal mining
If we use the following RDF/SKOS representation:
@prefix v: <http://www.example.com/vocab#>. @prefix skos: <http://www.w3.org/2004/02/skos/core#>. v:D001 rdf:type skos:Concept; skos:prefLabel "coal"@en; skos:altLabel "coal mining"@en. v:D002 rdf:type skos:Concept; skos:prefLabel "mining"@en; skos:altLabel "coal mining"@en.
... it is not possible to recover the original representation, because there is an ambiguity with the representation of alternative descriptors as described above. Appendix H. Semantic Coordination proposes a possible solution to this problem.
A significant issue in the RDF/SKOS representation of thesauri involves the problem of change management. This issue is perhaps more serious than those described above, because whereas the above issues only affect the ability to produce certain types of visualisation (in this case a traditional thesaurus-style representation), the problem of change management impacts directly on the behaviour of information retrieval applications.
Thesauri are generally constructed and managed as a collection of "terms". A "term" is usually given a "candidate" status for a period of time before it is fully accepted into the thesaurus proper. A "term" that starts out as a descriptor may become a non-descriptor, and vice versa. The usual method of deprecating a descriptor is to make it a non-descriptor, and assert a "USE" link to the new descriptor replacing it. The only method of deprecating a non-descriptor is to remove it from the thesaurus.
The difficulty is that no correlation is made between the changes made to the "terms", and the changes in meaning made to the underlying conceptual units of the vocabulary. For example, consider the thesaurus extract:
love UF affection affection USE love
Here, love is a descriptor and affection is a
non-descriptor. A change is made such that these are swapped for each other,
and the thesaurus becomes:
affection UF love love USE affection
This change is quite reasonable, and may not be intended to change the meaning of the underlying conceptual unit at all.
In the SKOS/RDF representation of this thesaurus, a change could simply be made to the labels of a single conceptual resource, i.e. the following:
@prefix v: <http://www.example.com/vocab#>. @prefix skos: <http://www.w3.org/2004/02/skos/core#>. v:A001 rdf:type skos:Concept; skos:prefLabel "love"@en; skos:altLabel "affection"@en.
... could simply become:
@prefix v: <http://www.example.com/vocab#>. @prefix skos: <http://www.w3.org/2004/02/skos/core#>. v:A001 rdf:type skos:Concept; skos:prefLabel "affection"@en; skos:altLabel "love"@en.
This seems straightforward. However, consider the following situation. A thesaurus begins as:
emotion UF love UF affection love USE emotion affection USE emotion
... and is then changed to:
emotion
NT love
NT affection
love BT emotion
affection BT emotion
Effectively, two new conceptual units have been established in the thesaurus by this change. Therefore the original RDF/SKOS representation:
@prefix v: <http://www.example.com/vocab#>. @prefix skos: <http://www.w3.org/2004/02/skos/core#>. v:A007 rdf:type skos:Concept; skos:prefLabel "emotion"@en; skos:altLabel "love"@en. skos:altLabel "affection"@en.
... becomes:
@prefix v: <http://www.example.com/vocab#>. @prefix skos: <http://www.w3.org/2004/02/skos/core#>. v:A007 rdf:type skos:Concept; skos:prefLabel "emotion"@en; skos:narrower v:A008; skos:narrower v:A009. v:A008 rdf:type skos:Concept; skos:prefLabel "love"@en; skos:broader v:A007. v:A009 rdf:type skos:Concept; skos:prefLabel "affection"@en; skos:broader v:A007.
How should information retrieval applications respond to this change?
Consider a final example. A thesaurus begins as:
Countries of the European Union SN The 15 member states.
... and is changed to:
Countries of the European Union SN The 25 member states.
Because the meaning of the underlying conceptual unit has changed significantly, in order to allow information retrieval applications to respond in some way a new conceptual resource must be identified. I.e. the original SKOS/RDF representation:
@prefix v: <http://www.example.com/vocab#>. @prefix skos: <http://www.w3.org/2004/02/skos/core#>. v:F001 rdf:type skos:Concept; skos:prefLabel "Countries of the European Union"@en; skos:scopeNote "The 15 member states."@en.
... is replaced with:
@prefix v: <http://www.example.com/vocab#>. @prefix skos: <http://www.w3.org/2004/02/skos/core#>. v:F002 rdf:type skos:Concept; skos:prefLabel "Countries of the European Union"@en; skos:scopeNote "The 25 member states."@en.
To affect an appropriate change in information retrieval behaviour, the
original conceptual resource v:F001 is removed from the
vocabulary, or indicated to be deprecated in some way (this is not currently
supported in SKOS), and a new conceptual resource v:F002 is
declared. Additionally, a mapping graph could be created { v:F002
skos:narrower v:F001. } and used by retrieval applications, in
conjunction with a query expansion algorithm such as described by Appendix D. Example Query Expansion Algorithm, to
adjust retrieval behaviour accordingly.
The point of this discussion has been to illustrate the difference between the way in which change in a thesaurus is traditionally managed, and the ways in which a corresponding SKOS/RDF representation changes. A thesaurus is typically managed as a collection of "terms", whereas a SKOS/RDF representation of a vocabulary is best managed as a collection of conceptual units. By managing a controlled structured vocabulary as a set of conceptual units, specific and controlled changes to the behaviour of information retrieval applications may be effected, as illustrated above, and this is at best difficult with the traditional management style and corresponding representation frameworks. However, if SKOS is designed to require a style of management that is significantly different from the management model embedded in standards such as ISO 2788 and BS 8723-2, this is likely to significantly retard its own acceptance as a standard.
Some thesauri employ a third type of entity in addition to decriptors and non-descriptors: "node labels". A "node label" is always associated with a group of descriptors, known in BS 8723-2 as an "array". The term "array" can be misleading because the order of the members of a group is not necessarily significant. The purpose of "arrays" and "node labels" is to provide additional organisational structure to the thesaurus, which in turn can aid indexers and searchers in locating appropriate descriptors.
For example, the following is an extract from a thesaurus that uses node labels and arrays:
people
<people by age>
infants
children
adolescents
adults
elderly people
<people by employment status>
employed people
unemployed people
In a traditional thesaurus-style representation, node labels are usually shown enclosed by angle brackets. In the example above, the order of the members of the first array is significant (increasing age), but the order of the members of the second array is not.
Note that BS 8723-2 specifies that an array should only ever be used to introduce a characteristic of division. E.g. in the above, age is the characteristic of division in the first array, and employment status is the characteristic of division in the second. However, some thesauri do not adhere to this recommendation, and use arrays and node labels as general purpose grouping constructs. For example, the AAT [@@TODO ref] employs the notion of a "guide term", which corresponds to an extended notion of a node label. E.g. from the AAT:
Sound devices
<sound devices by acoustical characteristics>
aerophones
chordophones
electrophones
...
<sound devices by function>
<ambient sound makers>
...
loudspeakers
metronomes
...
<sound modifying devices>
kazoos
megaphones
In the above example, some "guide terms" are consistent with the BS 8723-2 notion of a "node label" in that they introduce a characteristic of division (e.g. "sound devices by acoustical characteristics" and "sound devices by function") and some are not (e.g. "sound modifying devices").
The above example also illustrates the nesting of arrays, which is found in some thesauri.
The AAT notion of a "guide term" and the BS 8723-2 notion of a "node label" do however share the following assumption, which is that a "node label"/"guide term" should not be used in indexing or in searching. Node labels are purely a navigational aid, providing a means to organise a long list of sibling descriptors in a systematic way.
SKOS Core currently supports the RDF representation of these structures
via the skos:Collection skos:OrderedCollection and
skos:CollectableProperty classes and the
skos:member and skos:memberList properties.
However, there is a serious contradiction in the recommended usage given in
the SKOS Core Guide. The intention behind creating the class
skos:Collection was that this class would be disjoint with the
class skos:Concept, although this has not been formally
declared. I.e. a "concept" and a "grouping of concepts" are fundamentally
different things. However, the usage of semantic relation properties such as
skos:narrower to link a skos:Concept to a
skos:Collection, as illustrated in the SKOS Core Guide, leads to
the inference under RDFS entailment that a resource is both a
skos:Concept and a skos:Collection, which is
inconsistent with the intended disjointness between these classes.
Note that, because a "node label" is never involved in indexing or in user
queries, this structure is essentially irrelevant for information retrieval
applications. It is used only to generate representations of the thesaurus
for human consumption. An application attempting to compute a query expansion
(e.g. via the algorithm given in Appendix D. Example
Query Expansion Algorithm) needs only the graph of direct semantic links
between resources of type skos:Concept.
@@TODO workaround ...
--- Change Log --- $Log: thesaurus.html,v $ Revision 1.5 2006/07/31 12:50:49 ajm65 Minor spelling. Revision 1.4 2006/07/31 12:41:19 ajm65 Added deprecation note. Revision 1.3 2006/05/03 12:46:37 ajm65 Added discussion of node label issue, workarounds still todo. Revision 1.2 2006/05/02 15:16:35 ajm65 First completed draft. Revision 1.1 2006/04/29 14:25:10 ajm65 Initial. ---