$Revision: 1.30 $ on $Date: 2006/07/31 12:50:49 $ by Alistair Miles.
All revisions of this document can be obtained from http://isegserv.itd.rl.ac.uk/cvs-public/skos/press/dc2006/paper.html.
Please note that this document has been superseded, and is maintained for historical purposes only. The final (current) version of this paper is available from http://isegserv.itd.rl.ac.uk/public/skos/press/dc2006/camera-ready-paper.pdf (to be presented at the 2006 International Conference on Dublin Core and Metadata Applications).
Abstract
Is SKOS ready to be a W3C Recommendation? What needs to be done to make SKOS a genuinely useful standard?
This paper explores motivational, architectural, social, scientific, practical and theoretical issues that bear directly on the possible development of SKOS towards W3C Recommendation status. What is the scope (and therefore purpose) of SKOS? What key software components is SKOS intended to support, and how do they interact? What models of social collaboration and change management are assumed, and how are they evolving? Does SKOS depend on further original and/or integrative research, and what results are needed? Can SKOS be used alongside DCMI metadata applications, or alongside OWL applications? What major technical issues are unresolved, and what new features are needed?
Although this paper does not offer definitive answers to any of these questions, it aims to lay a foundation for discussion, in preparation for the third phase of the W3C's Semantic Web Activity.
Contents
Appendices
SKOS (Simple Knowledge Organisation Systems) [SKOSCoreGuide, SKOSCoreSpec] is a framework for representing a controlled structured vocabulary as an RDF graph [RDFConcepts], and for using controlled structured vocabularies in distributed information retrieval applications. It's current status is as a W3C Working Draft, and it is anticipated that it will be developed towards W3C Recommendation status within the scope of the next (third) phase of the W3C's Semantic Web Activity.
In preparation for the move towards standardization, this paper explores a number of outstanding issues. Key assumptions that have motivated design are articulated, so that they may be subjected to critical evaluation. The goal is to provide a foundation for open discussion leading in to the standardization process, and ultimately to ensure that SKOS can realise its potential and become a genuinely useful and practical tool.
A number of important questions remain unanswered with respect to the development of SKOS towards a W3C Recommendation.
Firstly, exactly what is the fundamental design goal? In other words, what is the scope for this work? The scope of a Recommendation is usually articulated by means of a set of use cases, so what are the motivating use cases for SKOS? These questions are explored in Section 2. Scope.
Secondly, what current/anticipated generic software components is SKOS intended to support, and how do/will those components interact? A standard is irrelevant without implementation, therefore what types of information system are relevant to SKOS, and what trends are anticipated? To what extent should SKOS support current and/or legacy systems? These questions are explored in Section 3. Architecture.
Thirdly, what current/anticipated social processes is SKOS intended to support? How do these social processes depend on software components, on the Internet, and on the World Wide Web? Are relevant current social processes well understood, and are they likely to remain stable, or are they evolving? Specifically, in what ways do people collaborate in the development and use of controlled structured vocabularies, and what models of change management are employed or required? How then can SKOS support adaptation in information systems where controlled vocabularies are evolving? These issues are explored in Section 4. Society.
Fourthly, to what extent does the standardization of SKOS depend on further original and/or integrative research work? For example, if SKOS is to support information retrieval applications, are the logical and computational properties of the required functionalities well understood? If not, which areas require further work? These questions are explored in Section 5. Research.
Fifthly, is the coupling of SKOS and the DCMI metadata standards well understood? SKOS is a natural partner to the DCMI metadata terms [DCMITerms], providing a simple way of expressing structure in the value space of the "subject" field. However, the DCMI guidelines on encoding metadata in RDF [DCQRDF] are being updated at the time of writing, and the DCMI abstract model [DCAM], although inspired by RDF, does not align exactly with the RDF model and semantics [RDFConcepts, RDFSemantics]. Given that SKOS is itself an application of RDF, how then should SKOS and the DCMI metadata terms be used in conjunction? Are there any areas where alignment may be less than obvious? These issues are explored in Section 6. Partnership.
Finally, Section 7. Features asks: which desirable features, anticipated to be in scope for future SKOS development, remain to be designed and tested? Are there any serious issues with current or proposed features?
This paper does not attempt to provide a complete answer to these questions. Rather, this paper attempts to build a foundation for open discussion, as a prelude to achieving the wider consensus required by a web standard.
The SKOS Core Vocabulary [SKOSCoreGuide, SKOSCoreSpec] was initially developed in the context of workpackage 8 of the Semantic Web Advanced Development for Europe Project [SWADEurope]. This workpackage was focused on the "migration" (a.k.a. "porting") of thesauri and classification schemes to RDF-based representations. SKOS Core was developed initially as a solution to the requirement for an RDF vocabulary that could represent the content and structure of a thesaurus conforming to the ISO 2788:1986 guidelines [SWADEurope8.1, ISO2788]. However, it became apparent during the development of "guidelines for migration" [SWADEurope8.8] that there is inherent variability in the structure of deployed thesauri, and that any thesaurus representation language must take this variability into account if it is to be a genuine aid to interoperability.
The scope of SKOS Core was at this time extended to any thesaurus whose purpose was as an information retrieval tool, given that custom extensions to the SKOS Core Vocabulary would be required to represent "non-standard" features. The assumption was that property and class "extensions" (a.k.a. "refinements") to the SKOS Core Vocabulary could be declared using the RDF Vocabulary Description Language [RDFS] in order to represent "non-standard" features. The RDFS entailment rules could then be used to infer a basic SKOS Core representation of a thesaurus from any representation that used extensions, providing a layered approach to extensibility and an automatic assurance of backwards compatibility.
A further deliverable explored the RDF representation of classification schemes [SWADEurope8.5], and suggested that SKOS Core could provide a reasonable basis for an RDF representation of at least one classification scheme and its application to document classification. The scope of SKOS Core was then extended to include the representation of thesauri and classification schemes whose purpose is subject-based indexing and retrieval. It was not ruled out that other types of controlled structured vocabulary whose purpose was indexing and retrieval, such as "taxonomies" and subject heading systems, might also be represented, although the suitability of SKOS Core for these purposes was not evaluated by the SWAD-Europe project.
Subsequently, the W3C Glossary Project [W3CGlossary] chose to use the SKOS Core Vocabulary for the RDF representation of the W3C glossaries [W3CGlossaryData]. The example of the W3C glossaries has been used in other publications to illustrate the simplest use of SKOS Core (e.g. [Miles200505]). This usage raised the question of whether SKOS Core would be a suitable RDF representation framework for glossaries in general.
There is a significant difference between glossaries and the family of vocabulary types that includes thesauri and classification schemes, in that the function of a glossary has nothing to do with information retrieval. The basic function of glossary is to convey the intended meaning(s) of word and/or multi-word term usage within a given scope (e.g. within a particular document or document set). As such, glossaries share something in common with "terminologies". A glossary is also similar to a dictionary, in that it maps a set of words and/or multi-word terms to a set of definitions.
The development and standardization of frameworks for the formal representation of "terminologies" (see e.g. [ISO16642, ISO12620]) has a distinct history, and has proceeded largely in parallel to the development and standardization of representation frameworks for thesauri, classification schemes and subject heading systems (see e.g. [ISO2788, BS8723-2]). This parallelism is probably due to the fact that terminologies have a different purpose from controlled vocabularies whose purpose is information retrieval, and therefore a different set of motivating use cases. A "terminology" is not necessarily a "controlled vocabulary", in the sense that a "terminology" attempts to describe the structure and usage of one or more natural languages, whereas a controlled vocabulary essentially defines its own (controlled) language. Because the purpose of glossaries, dictionaries and terminologies is completely unrelated to information retrieval, the associated set of software tools similarly support a completely different suite of functionalities (see also Section 3. Architecture).
A possible scope for SKOS is the representation of controlled structured vocabularies whose intended use is within information retrieval applications. The set of vocabulary features that establish the basic requirements for SKOS could be defined in relation to a suite of information retrieval functionalities that must be enabled, which are in turn defined by a set of information retrieval use cases. A mechanism for creating extensions to SKOS could be an integral part of its specification, and thereby a certain amount of the variability found in deployed vocabularies may be supported. However, if the fundamental requirement is to satisfy a set of defined information retrieval functionalities, then there would be no requirement for unlimited extensibility, and SKOS would be sufficient even if it were not possible to represent all features of currently deployed retrieval vocabularies without loss of information. SKOS could of course still be used to represent aspects of glossaries, terminologies or dictionaries, however this would not be a requirement.
An example use case that illustrates a typical information retrieval scenario involving the use of controlled structured vocabularies is given in Appendix A. Example Use Case. This use case describes a situation where the functionalities supported by semantically precise indexing and searching must be integrated with functionalities supported by statistical analysis of textual content, and with functionalities supported by active use of a retrieval portal. This is anticipated to be a key feature of information retrieval systems within which controlled structured vocabularies are used.
Should SKOS support the RDF representation of "folksonomies"? Folksonomies such as those established by the del.icio.us [Flickr] and flickr [Delicious] web communities are (unstructured) vocabularies whose purpose is information retrieval (in the first case retrieval of web pages, in the second case retrieval of photos). However, the extent to which they are "controlled" is debatable. They are controlled in the sense that the meaning of a "tag" is established by its use within a loosely defined community, however there is no central authority attempting to guide usage and therefore determine meaning. Further work is needed to establish the requirements for RDF representation of folksonomies.
What generic software components are required to enable the use of controlled structured vocabularies for information retrieval? We can begin to answer this question by describing a simple workflow: (a) a controlled structured vocabulary is created, (b) an index over a particular collection of resources is created using a particular vocabulary, (c) the index and the vocabulary are used to retrieve resources from the collection. Each of these three tasks requires a functionally distinct software component. Let us call these components A, B and C, corresponding to tasks (a), (b) and (c) respectively.
A fundamental assumption in the design of SKOS has been that these three components are the key components in a distributed software architecture whose purpose is information management and retrieval. Articulating a functional specification for each of these key components is therefore a prerequisite to understanding the functional requirements for SKOS, because the underlying purpose of SKOS from a systems point of view is to enable these components to interoperate. The main part of this section is devoted to sketching outlines for these component specifications. Note that other components may also be required to complete this architecture, and the latter part of this section explores what components may be missing from the basic picture.
First, however, let us state another fundamental assumption in the design of SKOS: we have assumed that the key components may become interoperable by virtue of the fact that they share a common data model, and nothing more. We have also assumed that these components do not interact directly, but indirectly via the Semantic Web. Appendix B. Interaction Models illustrates this "Semantic Web Interaction Model", and compares it with a "Direct Interaction Model" where components interact directly with each other via programmatic interfaces.
What are the benefits of desiging SKOS to support the Semantic Web Interaction Model? Firstly, it is a minimalistic approach to interoperability. By defining a common data model in terms of RDF, and reusing the established data publication and data access protocols of the Semantic Web (i.e. HTTP [HTTP] and SPARQL [SPARQLQuery, SPARQLProtocol]), only the data model itself need be standardised, and may be standardised in a way that is independent from any interaction protocols or query languages. Also, these web protocols are likely to be more stable than specific application protocols, and therefore we gain stability in addition to simplicity.
However, the real advantage of the Semantic Web Interaction Model is that it allows a web of application data to grow in an organic way.
Consider the difference between email and the World Wide Web (hereafter simply "the Web"). Email is an efficient and cost-effective way of passing information from one point to another, but it is not a means of linking information together. I.e. with email information becomes distributed, but it is also dispersed and isolated, it exists as many islands. It does not form a web. When information is distributed in this way, the same information is inevitably duplicated in many places. Managing the connections and dependencies between different pieces of information becomes an extremely complicated task, because both the information itself and the references between information have to be synchronised across many locations. This is a significant barrier to scalability.
When constructed as a web, information need not be duplicated because it has a presence in that web. Links and dependencies between information may be described unambiguously, by virtue of the fact that resources are given a unique identity within the scope of that web, i.e. information becomes referenceable. Responsibilities for the maintenance of information and of dependencies between information may then be clearly defined.
Note that designing SKOS with a Semantic Web Interaction Model in mind does not preclude its use in a Direct Interaction Model architecture. Note also that both a Semantic Web Interaction Model and a Direct Interaction Model are examples of service-oriented architectures: the difference is due to the fact that in the second case there is a dependency between the service interfaces and the data model, whereas in the first case these are entirely independent.
Let us return to sketching specifications for the three key components mentioned above.
Component A is a development environment for controlled structured vocabularies. It allows one or more people to collaboratively create and edit a vocabulary. The collaboration of many geographically distributed editors and contributors must be enabled, and therefore future implementations are likely to be Web applications. Additionally this component should support a continuum of collaboration models, from strict editorial control to anarchy of the masses (see also Section 4. Society). As a corrollary to this, appropriate change management procedures should be built in to the component's workflow. Current implementations tend to specialise on particular variety of controlled structured vocabulary (e.g. thesauri conforming to ISO 2788:1986 [ISO2788]) however a complete implementation will allow the construction of a controlled structured vocabulary according to a custom profile of features from thesauri, classification schemes, subject heading systems and/or taxonomies. Simple multilinguality (e.g. via multilingual labelling, see [SWADEurope8.3]) should be supported in a sensible way. Custom extensions to the basic vocabulary structure (for example, custom semantic relations) should also be supported. Finally, publication of a vocabulary in the Semantic Web must be handled transparently.
Component B is a development environment for creating and maintaining an index over a collections of resources using a controlled structured vocabulary. The index is typically subject-based, however component B should also support indexes over multiple metadata fields (i.e. a "multi-faceted index"), and should therefore allow the creation and/or import of a custom metadata application profile. As with component A, this component must enable the collaboration of many geographically distributed editors and contributors, and support a continuum of collaboration models. The work of the indexer must be automated as much as possible, by whatever means available (e.g. via statistical comparison of the textual content of indexed items with textual labels and annotations in the controlled vocabulary, and/or via machine learning algorithms operating on the index-so-far as the training set). Current implementations tend to focus on a particular style of index assuming a particular type of controlled vocabulary (e.g. subject classification with a classification scheme), however a complete implementation will allow construction of an index according to a custom profile of features, e.g. both primary and secondary subject allocations, with support for semantic coordination (see also Appendix D. Example Query Expansion Algorithm and Appendix H. Semantic Coordination). Component B is able to interact with the indexing vocabulary via the Semantic Web, and is also able to handle vocabulary changes in a sensible way. Finally, publication of an index in the Semantic Web must be handled transparently.
Component C is a retrieval tool. It allows a user to interact with one or more indexes over one or more collections using one or more controlled structured vocabularies, and in so doing discover resources of interest. Current impementations of this component tend to encapsulate a single index over a single collection, however future implementations must be able to take advantage of the ability to harvest and merge Semantic Web data, and thereby encapsulate a combined virtual index over many collections. Component C must be able to calculate relevance metrics in order to improve search precision by effective ranking, and must be able to expand result sets in order to improve search recall (see also Appendix D. Example Query Expansion Algorithm). Both of these functions require exploitation of vocabulary structure and latent index structure (i.e. both paradigmatic relationships between vocabulary units asserted in the vocabulary, and syntagmatic relationships between vocabulary units discovered by an analysis of the available indexes). Where search targets are resources that have textual content, this component must integrate transparently with other components providing search functionality with automatically generated content indexes (see also Appendix A. Example Use Case).
What other software components, beyond these key components, might be important?
Not mentioned so far is the role of vocabulary mapping. The task of creating a mapping between two controlled structured vocabularies is both cost- and time- intensive, and yet the benefits of being able to seamlessly search across indexes that use different indexing vocabularies are arguably significant, especially in a multilingual context. If a vocabulary mapping is already available and published in the Semantic Web, then we can extend the specification of component C to require that it transparently handle search across heterogeneous indexes, which does not necessarily require query translation if the appropriate expansion functionality has been implemented (see also Appendix D. Example Query Expansion Algorithm and Appendix I. Semantic Mapping).
In order to create a vocabulary mapping in the first place a tool is required: let us call this component D. Briefly, this component must handle mapping between different types of controlled structured vocabulary (e.g. between a thesaurus and a classification scheme), must assist the mapper in so far as is possible by what ever means available (e.g. by analytic comparison of textual labels and annotations in source and target vocabularies, and/or by employing ancillary lexical or terminological resources such as WordNet). Crucially, the mapping task must itself be as simple as possible, and achieve the minimum result which is to support information retrieval across heterogeneous indexes. Finally, component D must transparently handle the publication of a vocabulary mapping in the Semantic Web.
The components A, B, C and D described so far are all forward-facing, the sense that users interact directly with these components. However, does their implementation require a middle-ware layer above the Semantic Web service layer, or can these components interact directly with Semantic Web (i.e. HTTP and SPARQL) services? I suggest that there is a role for a layer of services above the Semantic Web. For example, implementing dynamic ranking and expansion functionality directly on top of a SPARQL service is likely to result in significant time delays for searchers, and responsiveness of end-user applications is an extremely important property. The analysis required to achieve ranking and expansion need only be performed once for any aggregated index, and the results of this analysis can therefore be cached. Each search application (component C) could perform this analysis itself. However, if so, multiple search applications may duplicate a similar analysis. Additionally, if each search application must implement potentially complex analysis functionality, this places high demands on search application implementers.
These factors suggest a role for an intermediate service between the vocabularies & indexes published in the Semantic Web, and a searching application that the user interacts with: call this component E. Briefly, this component aggregates one or more indexes from the Semantic Web, and performs an analysis over these indexes and over the vocabularies used. This component then uses these analyses to evaluate queries and return ranking and/or expanded result sets as requested. A simple service interface definition is therefore required for this component.
Will software vendors support such a functional separation between the key components described above, or will they resist this by favouring end-to-end solutions that lock customers in? For example, information management in many libraries is supported by a single software system, which may not support open data models. However, a functional separation does enable different vendors to target solutions to different parts of the workflow, i.e. it allows vendor's to specialise and to focus on what they do best. Not only is this a more efficient approach for the vendors in competition with each other, but it allows vendors to compete directly with open-source solutions that are likely to be well factored and built on open standards.
Finally, where do "metadata registries" fit in a Semantic Web architecture? The fundamental purposes of a metadata registry in this architecture are discovery, trust management and quality control. So, for example, a registry could be aware of a number of controlled vocabularies published in the Semantic Web, and could hold descriptive metadata about each vocabulary. A user or an agent unaware of controlled vocabularies in their domain of interest could query a known trusted registry, to discover potentially useful vocabularies. If the registry only holds vocabularies that have passed some sort of quality review, then the user gains a quality assurance by using a vocabulary discovered through the registry. The registry could optionally harvest the content of the vocabularies, and provide a SPARQL service over those data, or could simply indicate the location of the authoritative SPARQL service where the content may be obtained directly.
The previous section began by describing a simple workflow whereby a controlled structured vocabulary is used to provide a retrieval service to end users of a particular collection of items. This process, where by a controlled vocabulary is created, maintained, and used to establish an index over a collection of items, is costly, primarily because the extent to which it can be automated is very limited. Most of the work must be done by hand, and by people with specific expertise and training. Furthermore, a service based on this process does not provide any real value to the end user until the vocabulary is complete and relatively stable, and the index has been created and quality-controlled, i.e. bootstrapping a service of this kind requires a significant initial investment in terms of both time and money. In addition to this, as end user needs evolve, so the vocabulary and the index must evolve if the service is to continue to be relevant - and thus a significant ongoing cost is involved.
The fundamental questions therefore are: (1) under what circumstances is it "profitable" to provide services using a controlled structured vocabulary (i.e. when does the "value" outweigh the cost), and (2) what strategies can be employed to reduce the initial and ongoing costs?
Before we can begin to answer these questions, we must first consider alternative means by which the ultimate requirements of the end user may be met. The following discussion assumes that the end user has two fundamental requirements: (1) to be able to locate an item in a collection, of which the user already has some prior knowledge, and (2) to be able to discover items in a collection that are relevant to the user's interest, of which the user has no prior knowledge.
If all the items in a collection have some textual content, then rudimentary location and discovery services can be achieved by statistical analysis of this content, which can of course be fully automated. The performance of these services can be improved if the content is semi-structured in a consistent way, for example if the items are documents that contain headers at multiple levels (e.g. as in HTML).
Simple descriptive metadata, such as title, description, author etc. can also be used to achieve rudimentary location and discovery services. In the absence of any textual content, this is probably the cheapest way of making location and discovery services available. In the presence of unstructured or semi-structured textual content, this metadata may be used to provide additional functionality to the location and discovery service, and to improve the performance of existing services, although whether the added value will outweigh the costs of metadata creation and curation depends on the nature of the collection and the particular needs of the end users.
If items refer to other items in the same collection (e.g. via hyperlinks), and these references can be reliably extracted by an automated process, then structure derived from these references can be exploited to drastically improve the performance of services based on analysis of textual content and/or services based on descriptive uncontrolled metadata. For example, Google ranks web sites based on the number of "inlinks", which are weighted by the rank of the referring site, and it is this ranking algorithm that is undoubtedly the primary factory in Google's dominance of the web search market. However, it should be noted that while Google provides an excellent location service, its utility as a discovery service is questionable, especially where search terms have multiple meanings.
As an aside, the extent to which the structure of a collection can be exploited to improve location and discovery services depends on the information content or entropy of the structure. I.e. it relies on some items receiving more references than others ("attractors"), and also on this unevenness being correlated with "quality" or "relevance". An interesting property of a system like Google and the Web is that it creates a tendency for feedback effects, whereby the strength of existing attractors tends to increase. I.e. if a web site has a high rank, it is more likely to be found, in which case it is more likely to receive references, which in turn increases its rank. This same property can also retard the discovery of newly added items, because a critical mass of references is required before a new item will gain a sufficiently high rank to appear within the topmost results.
If the behaviour of the end users of a collection can be captured and correlated, then this information can be used to provide an information discovery service, or to improve services achieved by other means. The simplest example of this strategy is Amazon's referral service (i.e. "people who bought X also bought Y and Z"). A more sophisticated example is provided by social bookmarking websites such as del.icio.us [Delicious]. Because users tend only to bookmark web sites they have visited and found useful, this provides a powerful quality filter. Also, because users may discover other users with similar interests via the bookmarks they share in common, a social network is established. Because anyone may view anyone else's bookmarks, users can exploit this social network to discover items within their domain of interest. This discovery process is refined by the use of "tags" to organise bookmark collections, allowing users to target specific subject areas within other users' collections (and of course locate items in their own collection).
The popularity of social bookmarking websites such as del.icio.us demonstrates the fundamental role of social interaction in the process of information discovery. If a discovery service can mediate social interaction between its users via their use of the collection, then the superior knowledge some users have of specific areas of the collection can travel rapidly throughout relevant parts of the user community. For collections where items do not reference each other (e.g. a catalogue of books), the structure created by this network of interactions is certainly invaluable especially for information discovery. For collections where items do reference each other (e.g. a collection of hypertext documents or a collection of scientific papers with citations), the growth of social bookmarking websites in the shadow of Google proves that the additional layer of structure derived from social interaction adds significant value to the layer of structure derived from item-to-item references. Reasons for this may be that the network of bookmarks is less easily exploited for selfish ends, and also facilitates much more rapid discovery of new items, because bookmarks are formed and therefore propagate faster than new references in content.
This is the context in which we must evaluate the utility and profitability of information discovery and location services derived from the use of controlled structured vocabularies (see also Appendix A. Example Use Case). Sophisticated statistical techniques for the analysis of textual content, and/or facilitated social interaction networks, can underpin fully automated services, which therefore incur minimal ongoing costs. Where a controlled vocabulary is profitable, the value it adds over and above that already provided by other means must outweigh the costs of manual creation and curation of vocabularies and indexes. Whether this is the case for any given collection or group of collections will depend on a number of factors related to the nature and scope of the items, the dynamics of the collection, and the specific needs of the end users.
However, let us assume that there are circumstances under which a controlled structured vocabulary can be "profitable" - perhaps because users of a particular collection have highly specific needs and therefore demand a minimum level of service that cannot be achieved by other means - and explore ways in which the costs of creation and maintenance of controlled structured vocabularies can be minimised.
@@TODO
Social tagging of websites and images provides a counterpoint to the traditional model of specialist vocabulary editors and indexers. They are at opposite ends of a continuum along several dimensions: large open development community versus small exclusive editorial team; meaning determined by use versus meaning determined by authority; totally decentralised collaboration versus totally centralised management; no change control versus highly structured change control; rapid continuous change versus slow discrete change; ... @@TODO
This perspective has opened up the opportunity to explore the region between these two extremes, and in so doing find ways of reducing the cost and maximising the utility of manual indexing. Predict that a collaboration model somewhere between the two will become highly fruitful, where effort is decentralised but quality is improved via centralised editorial priviledges ... @@TODO
This raises the question of what sorts of change management model will prove useful under these circumstances, and to what extent can these models be standardised, so that enough change information can be published in a structured way to support useful behaviours of information retrieval applications under change. ... TODO
@@TODO focus on information retrieval ...
If we accept a scope for SKOS defined around its use within information retrieval applications, are there any areas where our understanding of the required behaviour of those systems is yet to be developed?
Take for example a very standard IR functionality. Hierarchical vocab, query node, get results for that node and all specialisations. This is so common requirement that SKOS currently includes a rule, can use rule-based inference to achieve the closure. However, this does not allow for e.g. ranking based on closeness to original query node. Can extend to more advanced expansion features, but run into other complexities such as closeness of siblings is generally greater at greater depth ... This requires calculation of weights based on some algorithm ... query expansion in more general terms ... types of query, structure of a query in general terms ... need to understand preferred expansion algorithms - which provide genuine utility, and under what circumstances?
Can this sort of algorithm be efficiently implemented using RDF technologies? What would the implementation strategies be?
In the above example algorithm, no use was made of the frequency with which vocabulary concepts are used in the index (although in the particular example given this would not have made any difference, as each concept was used exactly once). In textual indexes, usually document frequency (i.e. the number of documents which contain a given term) (??@@TODO is this the right one?) is part of the score calculation for any query, (@@TODO for what reason?) (@@TODO ref). This is analagous in controlled vocabulary indexing to the number of items that have been asserted to have a given subject, and could be used to influence relevance.
In the above example algorithm, the type of index was deliberately chosen to be the simplest possible case (I called this a "simple single-subject index"). This type of index is typical of the use of a classification scheme (i.e. each item is assigned to one category only). However, in the typical case for the use of athesaurus, each item may be asserted to have many subjects, which is entirely appropriate for most items (let us call this a "simple multiple-subject index"). A hybrid situation can be imagined, where it is possible to assert several subjects for any item, and to assert one of these as the primary (i.e. main, principle) subject (let us call this a "simple primary/secondary-subject index"). The most complex situation involves indexing along multiple "facets", i.e. using a number of different, disjoint vocabularies to assert e.g. the primary/secondary people, periods and locations that are the subjects of the resource (let us call this a "multi-faceted index").
If we allow an index to have multiple subjects for any item, then a number of additional complicating factors are introduced. Firstly, it then makes sense to allow complex queries, i.e. queries of the general form "+X1 ... +Xn Y1 ... Yn -Z1 ... -Zn" (matching items must have subjects X1 ... Xn, may have subjects Y1 ... Yn, and must not have subjects Z1 ... Zn). How then should we combine the relevance scores from matching multiple optional query components? Is a simple additive model sufficient? Does the order of components given in the query matter? If the index is a primary/secondary-subject index, then matches against primary subjects should score more highly than matches against secondary subjects, in which case how should this be factored into the overall calculation of relevance? How should document-frequency be factored in? And finally, in the case of a distributed indexing framework like delicious, how should indexing frequency (i.e. the number of times an indexing assertion as been made, analagous to term frequency in text indexes) be factored in?
The particular relevance degradation function used in the example above was chosen to give an asymptotic shape to the degradation of relevance as we move along a path away from a query concept, however other general functions can be used (e.g. linear, sigmoid). Which of these functions provides the greatest utility in terms of retrieval behaviour, or does it not matter? Also, the algorithm above only computes a single relevance metric, however independent scores along multiple dimensions could be computed - would this be useful? Finally, the algorithm above takes no account of the depth at which the path steps are located, with respect to the generalisation tree as a whole, however some studies have suggested that "semantic closeness" between siblings increases with depth and therefore should be factored in also (@@TODO ref).
@@TODO Implications of the above for mapping and query evaluation/translation.
The point of this discussion is to highlight the areas where further research may be required, both theoretical and empirical. If SKOS is to support a suite of information retrieval functionalities, we need to know what those functionalities are, to what extent they are useful under different circumstances, and crucially what information must be represented in order to enable their efficient implementation.
As stated in the introduction, SKOS and the DCMI metadata terms are natural partners. The "subject" element of the DCMI element set is commonly referred to as a "bucket" - anything and everything gets put there. SKOS provides a simple, lightweight, way of expressing how the values of the dc:subject field are being controlled, and how the chosen controlled vocabulary is structured. This in turn allows simple yet useful search and browse applications to be constructed, before we even go so far as to consider additional functionalities such as complex queries and query expansion.
The semantics of the dc:subject element and its recommended usage with controlled subject vocabularies are defined in terms of the DCMI abstract model (@@TODO ref). The abstract model provides a basis for consistent interpretation of the various encoding syntaxes, such as DC-XML. Expressing Dublin Core metadata in terms of RDF is commonly viewed as just another encoding syntax, however this is a misconception. RDF has its own model and semantics, and therefore defines its own set of interpretation rules (@@TODO ref). In order to express Dublin Core metadata as an RDF graph, we therefore have to understand the mapping between the DCMI abstract model and the RDF model theory. Because SKOS is itself an application of RDF, its semantics are defined entirely in terms of the RDF model theory, and therefore in order to use SKOS and DCMI metadata terms in combination we need to be able to map DCMI metadata semantics into RDF.
This problem has serious implications for SKOS. (@@TODO Illustrate this problem somehow)
@@TODO SKOS and OWL, illustrate comparison of identity reasoning.
SKOS also provides a bridge between Dublin Core Metadata and the Web Ontology Language (OWL) [@@TODO ref]. At first glance SKOS seems to provide similar functionality to OWL, so what exactly is the relationship between these two languages?
The skos:Concept class is the fundamental resource type in
SKOS. This class is currently defined formally as a resource of type
rdfs:Class, and the natural language definition of this class
indicates that a resource of type skos:Concept is "an abstract idea or
notion; a unit of thought" [@@TODO ref]. This is a rather vague definition,
however, and a more pragmatic definition might be "a conceptual unit of a
controlled vocabulary" - i.e. a unit of a controlled vocabulary that has a
distinct meaning. A resource of type skos:Concept does not have
any "extension", i.e. it does not abstract a class of individuals.
The comparison with OWL is most notably revealed by considering the
handling of identity reasoning. In an OWL ontology, if two individuals (e.g.
two foaf:Persons) are deemed to be the same, then they may be
asserted to be so via the owl:sameAs property, which licenses a
particular set of inferences to be made. Essentially, the two individuals are
merged in the ontology - they become a single individual. However, if two
resources of type skos:Concept are deemed to have the same meaning, the
current specifications explicitly recommend that these nodes are not merged
in the graph, and that owl:sameAs is not used. The reason for
this is that, although a conceptual unit from a controlled vocabulary may
have the same meaning as another conceptual unit from another vocabulary,
this does not mean they are in fact the same resource.
Two options exist for handling identity (i.e. sameness) in SKOS.
The first is to allow an assertion of
So how do we handle the situation where an ontology and a controlled structured vocabulary represented in RDF using SKOS cover the same subject domain? ... USE CASE (flickr?) !!! @@TODO
Illustrates that, in order for SKOS and OWL to be used effectively in combination, we need to understand the low level relationships between them.
@@TODO something about using OWL to define SKOS extensions? Hybrid SKOS/OWL vocabularies.
@@TODO new features, bugs etc.
Current issues ...
Representation of thesauri in RDF using SKOS ...
A major use for SKOS is the RDF representation and Semantic Web publication of thesauri that conform more-or-less to the ISO 2788 standard (@@TODO ref). However, several design issues require a resolution here.
A significant drawback of current thesaurus-based retrieval systems is that the lexical value of a descriptor is used both as a label in visual interfaces, and as a means of reference in databases or other information systems. Such retrieval systems cannot respond to changes of meaning in the thesaurus, because they have no way of referring to the current meaning associated with a descriptor (i.e. they cannot differentiate between the different meanings that may be associated with a descriptor over a period of time). The consequence of this is that, in an index that has been developed over a period of time, the way a searcher uses a descriptor in the present may differ significantly from the way the indexer applied the same descriptor at some point in the past. The only way of responding to changes of meaning in the thesaurus is for the indexers to manually re-evaluate the index assertions for any item that uses a descriptor with a changed meaning. This is obviously a costly and time-consuming task.
An RDF/SKOS representation requires the allocation of unique identifiers (URIs) to the distinct meaningful units of a thesaurus. By encouraging a separation between the labels presented to searchers and the means of reference (i.e. the identifiers) used by vocabulary administrators and within computer systems, retrieval applications may respond to change in a meaningful way, and crucially can either fully or at least partially automate the task of updating an index.
To illustrate this point, consider the following example. A controlled vocabulary is initially created in 1997 with two conceptual units as follows:
@prefix ex: <http://www.example.com/vocab#>. ex:A1_a rdf:type skos:Concept; skos:prefLabel "Tony Blair"@en; skos:altLabel "UK Prime Minister"@en. ex:A2_a rdf:type skos:Concept; skos:prefLabel "Gordon Brown"@en; skos:prefLabel "UK Chancellor of the Exchequer"@en.
This vocabulary is then used to index documents between 1997 and 2006. The indexing system works e.g. An indexer indexing the Downing Street web site selects "Tony Blair" from their visual interface, which causes the following assertion to be recorded in the index:
<http://www.pm.gov.uk> skos:subject ex:A1_a.
In 2006, the controlled vocabulary is modified as follows:
ex:A1_b rdf:type skos:Concept; skos:prefLabel "Tony Blair"@en. ex:A1_b rdf:type skos:Concept; skos:prefLabel "Gordon Brown"@en; skos:altLabel "UK Prime Minister"@en.
... and the search interfaces are updated to use the new vocabulary. Now, if a searcher selects "Tony Blair" in some way from their visual interface, they will not receive the Downing Street web site in their result set, even though no changes have been made to the index. This is because the search causes the following query to be applied to the index:
?x skos:subject eg:A1_b.
... which of course yields no bindings for the variable x if
directly evaluated against the index.
A "mapping graph" is also created at the time of the change, and consists of the following statements:
ex:A1_a skos:related ex:A1_b. ex:A2_a skos:related ex:A2_b.
This mapping graph is published separately from the vocabulary itself, and may be used by search applications to respond to the changed meaning in useful ways. For example, if the application doesn't consider the change in meaning to be significant, it could use the mappings to translate queries on the fly, and then evaluate translated queries against the original index. Or, if the application considers the change in meaning to be significant but small, it could merge the mapping graph with the structure graph for the vocabulary, and directly evaluate queries against the index using an expansion algorithm similar to the one described in section 5. Then, a searcher selecting "Tony Blair" would still receive the Downing Street website in their result set, but it would be ranked lower than other resources about Tony Blair that had been indexed since the change. The interesting point to note here is that the use of an expansion algorithm such as the one described above removes the need to translate queries, and niether strategy requires any amendment to the index.
In the scenario described above, when viewed as a thesaurus, the descriptors "Tony Blair" and "Gordon Brown" have changed in meaning, and in traditional retrieval systems there would be no way of responding to this change other than by manual re-indexing. Because of the separation of concerns that SKOS encourages,this change in meaning may be made available to retrieval applications in a way that allows them to handle the change in a variety of ways.
However, the simplicity of the SKOS representation, and the details of the mapping from a thesaurus to a SKOS representation, create some problems.
A serious consideration for SKOS is whether the possibility of making a round-trip from a traditional thesaurus representation to SKOS and back again without loss of information is a fundamental requirement for SKOS. If not, then the workaround above is not needed.
@@TODO move much of the following to the "Society" section.
The problem above however does not affect information retrieval functionality. It only affects the ability to produce certain types of visualisation (in this case a traditional thesaurus representation). However, this difference in underlying structure between SKOS/RDF and a traditional thesaurus leads to a more serious problem that does affect information retrieval functionality.
Representation of classification schemes in RDF using SKOS ... meaning of skos:broader - is it appropriate to use it for class division that is really coordination?
Also issue with expressing labelling semantics, are the implied semanitcs OK?
Mention issue of OWL integration & import?
Needed features ...
1. Support for coordination of conceptual units. Benefits - allow coordinations within complex queries, differentiate between a complex query and a coordination. Possible solution.
2. Mapping simplification ... at least provide computational basis for implementation, then explore costs/benefit tradeoffs through tool development. Mapping definitely in scope, mapping for retrieval.
3. Extensibility ... already draft, needed within main spec.
@@TODO
--- Change Log --- $Log: paper.html,v $ Revision 1.30 2006/07/31 12:50:49 ajm65 Minor spelling. Revision 1.29 2006/07/31 12:38:03 ajm65 Added deprecation link. Revision 1.28 2006/05/07 17:29:13 ajm65 Rewrote much of section Society, basic framework now there, much text in final draft shape, still TODO discussion of collaboration and change management models. Revision 1.27 2006/05/05 16:53:59 ajm65 Firm beginning to society section, still lots TODO. Revision 1.26 2006/05/05 14:38:29 ajm65 Some new content in society section, still very rough. Revision 1.24 2006/05/04 15:37:27 ajm65 Some refinement of section Architecture. Revision 1.22 2006/05/04 10:17:26 ajm65 Very minor change. Revision 1.20 2006/05/03 16:13:37 ajm65 Edits to introduction and to scope sections. Revision 1.19 2006/05/02 15:26:31 ajm65 Removed thesaurus representation stuff to appendix. Revision 1.18 2006/04/30 12:27:19 ajm65 Moved text out to DCAM and OWL compatibility sections Revision 1.17 2006/04/29 14:27:36 ajm65 Refactored, much moved to appendixes as separate files. Revision 1.16 2006/04/28 18:48:30 ajm65 Added some discussion of change issues, currently in features section but probably should be moved to society. Revision 1.15 2006/04/27 18:39:11 ajm65 Developed some discussion in the Features section around RDF representation of thesauri. Revision 1.14 2006/04/27 13:04:38 ajm65 Minor edit to document header. Revision 1.13 2006/04/26 17:43:33 ajm65 Added outline for Features section. Added first draft of abstract. Revision 1.12 2006/04/26 16:27:57 ajm65 Added very rough draft of OWL discussion to partnership section. Revision 1.11 2006/04/26 10:40:14 ajm65 Rough draft of RDF-DCMI-SKOS mapping problems in partnership section. Revision 1.10 2006/04/26 09:23:53 ajm65 Finished first draft of research section. Revision 1.9 2006/04/25 19:40:17 ajm65 Finished draft of example expansion algorithm Revision 1.8 2006/04/25 19:15:41 ajm65 Rough draft of Society section, some more notes in research section. Revision 1.7 2006/04/24 16:33:25 ajm65 Almost completed first draft of scope section. Revision 1.6 2006/04/24 12:35:04 ajm65 Finished first draft of section 'Architecture'. Revision 1.5 2006/04/22 12:48:47 ajm65 Added some more notes on architecture. Revision 1.4 2006/04/22 01:50:48 ajm65 Added some discussion of software architecture. Revision 1.3 2006/04/21 16:19:27 ajm65 Added some more notes on scope. Revision 1.2 2006/04/21 15:50:11 ajm65 Added section titles, some notes on scope and a few notes on architecture and society. Revision 1.1 2006/04/21 15:03:40 ajm65 Initial revision. Introduction given with basic layout of document structure and goals. ---