18th March 2004
Even in the specialist Information Retrieval (IR) community, there is a certain amount of confusion about data, metadata, the difference between them, which data is of what kind and so on. There's even more confusion outside the IR community as people try to figure out what the IR people are talking about. And related to these issues, there are problems when it comes to clearly expressing concepts such as authorship and language.
As always, a good example is worth any amount of theoretical exposition, so here is one that touches on most of the issues.
Consider three things:
We have plenty of scope for confusion here: two chunks of data, three languages and some administrative processes. Let's look at them in a little more detail.
A resource is defined in the IR world to mean a thing that exists outside of the IR system itself, but is described by it. In fact, it's pretty much the case that IR systems exist solely in order to help people locate resources.
The usual example of a resource is a book: its full text is rarely available on-line, so that anyone who wants it needs to go to a library and borrow a copy, or buy one from Amazon or something. Other examples of resources might include CDs, objects held in a museum collection, web-sites and historic monuments.
The content of the resource is often referred to as data - that is, the actual words of the book, the bit-stream of CD-audio, or (rather less obviously) the vase or monument itself.
A resource is represented in an IR database by a record - an assemblage of data about the resource itself. The record stands as a proxy of the resource for searching purposes; and when retrieved it contains information about the resource, often enabling the resource itself to be located (e.g. giving the ISBN of a book, a URL for a web-site or the latitude and longitude of a monument.)
A record for A la Recherche du Temps Perdu might look like this, then:
This data about a resource is usually referred to as metadata, in contrast to the data itself which is the content of the book that the record describes.
But wait! This record is itself a resource: that is, it's a thing that's out there, that people can refer to. It too has certain attributes: no title, perhaps, but certainly an author and a date of publication; and the record is written in a specific language.
As it happens, I just made up the record, so a record about the record might look like this:
This is data about metadata - meta-metadata, if you like. Another way of thinking about it is that this is the metadata for the record we sketched in the previous section, and the record is itself data. In other words, one man's data is another man's metadata.
In theory, you could store this meta-metadata record, with many others like it, in their own database (or meta-database if you like to be pedantic about such things). That approach, though, is both cumbersome in practice, and theoretically troubling - because, you need a meta-meta-metadata record to describe the authorship and language of the meta-metadata; and so on.
So what we usually do instead is store this data inside the record itself. Then we can call it record-data - not a standard term, but one introduced in this document. Record-data elements look like this:
Then we can just throw the elements from exhibits 1 and 3 together into a single record which describes both itself and the original resource.
Take a moment to appreciate what's happened here. A single record has two author fields, Proust and Taylor: one is the author of the resource, and one is the author of the record. Likewise, the resource and the record have different publishing dates. And while the original resource is written in French, the record that describes it is in English.
Now suppose someone wants to search the database of metadata records such as the one described in the previous section. A query like
author=Proustis useful: it allows you to find the records for all the books written by Proust. But the query
recordAuthor=Tayloris also useful: it allows you to find out which records in the database are my work; so that if, for example, I happen to be your teacher in the art of cataloging fiction, you can search for the catalog records I've made during the past year and use them as a model to emulate. (This happens to me all this time.)
Similarly, compare the pair of searches:
date > 1910
recordDate > 2002-12-01
The former finds the records for books written since 1910; the latter finds books whose records were created since the 1st of December 2002. Very different, but both useful. (The latter could be used to implement a ``what's new'' facility, for example.)
The same dichotomy exists between language (of the book) and recordLanguage, of course; but in this case there is yet a third language to take into account: the language of the search-term itself.
Suppose the database's search engine is enhanced by a multilingual thesaurus: then a search for the English word ``rememberance'' or the German word ``Erinnerung'' (roughly equivalent) might also find the Proust novel, since they are broadly equivalent to the French word ``recherche''.
This is all very helpful, but such facilities must be applied with care: consider that the French word ``addition'' means the bill to be paid at the end of a meal. If you search for the English word ``addition'', you only want records about adding things up - you don't want the search engine to assume that it can also find records about post-meal bills. It should only do that if it knows that the search-term is French.
So in the most general case, there's a need for a query to specify the language of its search-terms, like this: (The details of how this is specified will vary between query languages, of course: don't get hung up on the syntax of the example, we're trying to make a semantic point here.)
title=Erinnerung WHERE termLanguage=German
This query, then, might find the Temps Perdu record that began this example.
What's happened here? We have three languages involved, all at once. The resource is a book written in French; the record that describes it is in English; and the query that's used to locate it has its terms in German.
All of the previous section should be clear. If it's not, then it's because I've written it badly, and you should contact me and let me know.
But we now come to the much more vexed question of how all these different authors, publication dates and languages are represented under the various conventions in use in the contemporary IR landscape.
So far as I am able to determine, the Dublin Core elements are intended only for describing resources, and do not address what we have called record-data at all. This can be seen as a major omission in the Dublin Core model, and has resulted in some of the DC elements being widely misused: for example, the Dublin Core creator element is sometimes used to store the name of the metadata record's creator rather than that of the resource.
When I am installed as SUPREMOR, one of the first things I do will be to add record-data elements to the Dublin Core.
Traditional Z39.50 systems have mostly catered for searching by means of the all-things-to-all-men attribute set, BIB-1. (It started life as a bibliographic set, hence the name, but all sorts of junk has been dumped into it over the last few years.)
According to the BIB-1 semantics document, there are at least seven different flavours of author-related access point, including Personal name (1), Author (1003), Author-name personal (1004), Author-name corporate (1005) and Author-name conference (1006). But all of them seem to pertain to the resource author rather than the record author.
On the retrieval side, things are rather better: the generic record syntax (GRS-1) uses tags drawn from tag-sets G and M and there is a fairly clear distinction between the roles of the two tag-sets: tag-set G describes information about the resource (what is traditionally if misleadingly known as ``metadata'') and tag-set M describes information about the record that describes the resource. For example, tag-set G's element 2 (author) is the author of the resource, such as Marcel Proust; while tag-set M's element 27 (recordCreatedBy) is the author of the record describing the resource, such as Mike Taylor. Similarly, tag-set G has element 20 (language), that is, the language of the resource; while tag-set M has element 22 (languageOfRecord).
So traditional Z39.50 seems to provide the vocabulary for the records returned by servers to describe themselves as well as the resources for which they are proxies; but it does not allow for searching on the record-data fields.
The Z39.50 Attribute Architecture provides a coherent framework in which multiple attribute sets can be defined, giving a broad palette of possible searches in various domains. It can be seen as a response to the mess that the BIB-1 attribute set had become by the late 90s.
Within the Attribute Architecture, two interesting attribute sets are defined:
But the utility set goes further than this: it provides an additional attribute type, 2 (``language''), the value of which specifies the language of the search-term to which the attribute is attached - which, of course, may be different from both the language of the original resource and the language of the record that describes it.
So the Z39.50 attribute architecture is powerful and specific enough to express queries like:
language = French
AND
recordLanguage = English
AND
title=Erinnerung WHERE termLanguage=German
SRW is the emerging new Web Services protocol for information retrieval, and CQL is its associated query language. They seem to provide a clear model of the metadata/record-data distinction, for both retrieval and searching.
On the retrieval side, metadata is represeted using the Dublin Core elements in an appropriate XML Schema; while record-data is represented using the new record-data schema. It is trivial to produce an XML Schema which imports both of these, so allowing metadata and record-data to be freely mixed within a single record.
On the searching side, search-terms in CQL may be associated with a indexes, which indicates what parts of the records in the database should be matched against the terms. Indexes are drawn from context-sets, which are identified by URIs (and, like XML namespace URIs, are abbreviated to short names). Among the context-sets already defined are both a Dublin Core set (for metadata) and a Record-Data set. Indexes from multiple sets may be freely mixed within queries, so for example the following are valid CQL queries:
dc.creator = Proust
rec.createdBy = Taylor
dc.date > 1910
rec.lastModified > 2002-12-01
One important wrinkle slightly spoils the picture, though: as currently defined (version 1.1), CQL does not seem to be capable of expressing the language of a search-term; or at least, there is no standard way to do so. This may be fixed in future versions.
The best way to close comments is with ``*/'' - it was good enough for Brian and Dennis.