Friday, December 02, 2005

Identifiers and Subject Access

A while back I posted a criticism of David Weinberger's piece in the Boston Globe. He was kind enough to respond. Since many folks might miss the comments, I'm reposting them here.
Here's what I was trying to say, in a highly-compressed article.

Of course subject headings let us classify objects in more than one way. But the number of subject headings under which an object can fall is limited by the physical constraints of card catalogs and books. Further, the physical world requires us to shelve books in one spot and not another. (Multiple copies can be shelved in multiple spots, but that gets messy fast.) So, if we want a collection through which users can roam, we have to make a decision about the primary subject area within which the book will be physically shelved, and then a limited number of other subheadings under which it can be classified (with some number of see-also's). The limit (ten for the LoC, for example) is based not on the number of subject headings that might be relevant but on the awkwardness of physical material.

Digitizing the content as well as the metadata not only removes the limitation, it also allows for richer ways of identifying books one might want to read. Subjects, author and title are obvious ways we want to find books, but there are many more relationships that are useful for locating books we know or don't yet know we want to read. Cf. Amazon for a commercially-inspired -- and plain old inspired -- example.

But, to enable these richer ways of finding books, we need identifiers. IMO (and it's an uncertain opinion), semantics-free global unique IDs are the best choice. The minimal semantics and prevalence of ISBNs make them a good candidate, although there are some obvious problems with them (e.g., they only started in the 1960s). In any case, there's no reason to stick with a single set of GUIDs because computers are good at coordinating multiple sets of related data. So bring on the multiple ID schemes! (I hope Google Print publishes whatever ID's its using internally.)

That's what my piece in the Globe intended to say. If it led readers to a different understanding, then I wrote it badly.
Libraries provide many more access points than authors, titles and subjects. Format, genre, geographic codes, publisher numbers, time codes, keywords, and dates of publication or content all spring readily to mind. The bibliographic record in a library catalog is a very rich source of metadata. How easy it is to access that richness is another story. Collocation by many different facets is possible with the current metadata. Users can roam through the search results as easily as through digital collections.

Due to concerns about patron privacy we have not implemented recommendation systems. I think we could do so and still protect an individual's personal data. I think we will move in that direction in the next few years.

Identifiers are a problem. There will, as you suggest, have to be many. There already are. Many records in a library catalog will contain an ISBN, EAN and UPC. Many other standard identifiers can be included in a bibliographic record.

A greater problem is what do the identifiers identify. If I'm looking for Hamlet do I want a particular format, or edition? Would a book on CD do or a large print, or a film do, or do I require the Everyman's edition with a particular introduction? ISBNs are acceptable for identifying a particular manifestation. Searching for a expression or all manifestations of a work is a problem. OCLC has the xISBN service that collects all other ISBNs for a work and allows searching by all of them. That helps somewhat, it is not a good long-term solution. Librarians are working on an identifier for works. Parts of a work will also need to have identifiers, maybe standard citations would work. The OpenURL is a possible solution since it uses citation data. The Functional Requirements for Bibliographic Records (FRBR) will be useful in pulling together all the different manifestations of a work and differentiating among them.

Folksonomies, trackbacks, reader's comments will all enrich access to materials in the library (either physical or digital) in the not too distant future. RSS allows distribution of new item lists and other information from libraries. This is already being done and will become more widespread.

Thursday, December 01, 2005

Cataloging Aerial Photographs

A Digital Archive of Illinois Historical Aerial Photographs (ILHAP) by Arlyn Booth describes making this collection available. Part of the paper deals with MARC cataloging and Dublin Core metadata.

OLAC Newsletter

I've just received my print copy of the OLAC Newsletter. OLAC members should be getting their copies soon, if they haven't received it yet. If you are not a member, why not? 2006 is a conference year, an excellent time to join.

Wednesday, November 30, 2005

Searching Repositories

OJAX is an open-source tool that provides a highly dynamic AJAX based user interface to a federated search service for OAI-PMH compatible repository metadata.

OJAX is simple, non-threatening but powerful. It attempts to minimise upfront user investment and provide immediate dynamic feedback, thus encouraging experimentation and enabling enactive learning.

SKOS Documents from the W3C

The W3C Semantic Web Best Practices and Deployment Working Group has announced the publication of the following technical reports as second W3C Public Working Drafts:A summary of revisions since first Working Draft publication are available.
SKOS Core is a simple, flexible and extensible language for expressing in a machine-understandable form the structure and content of concept schemes such as thesauri, classification schemes, subject heading lists, taxonomies, 'folksonomies', other types of controlled vocabulary, and also concept schemes embedded in glossaries and terminologies.
I thought folksonomies were uncontrolled. Maybe they should not be in the description.

Textbooks Online

Here is a worthwile effort
Welcome to Textbook Revolution, the web's source for free educational materials.

In response to the textbook industry's constant drive to maximize profits instead of educational value, I have started this collection of the existing free textbooks and educational tools available online. This website has several reasons for being:

  • To serve as a catalog of resources for students and teachers looking for free textbooks (one-stop shopping)
  • To act as a mirror for files. Mirrors help reduce bandwidth costs and prevent files from disappearing if a website goes out of business.
  • To promote the need for and availability of free textbooks.
Please look around and enjoy the site. I'll be adding books and links as fast as I can. If you have something you'd like to contribute, please email submissions at textbookrevolution dot org

Tuesday, November 29, 2005


At my recent talk I pointed folks to my FURL site for links to all the tools I discussed. I've just received a note from a school librarian saying the filter will not allow access to that site. That is so wrong. Is Google or Yahoo filtered? FURL and Delicious and Technoriti and the rest are another method of finding information on the Web. Why can't a teacher or librarian bypass the filter? Commercial filtering programs also have hidden biases. A better option would be to go with an open source product like DansGuardian.

Spelling Catalog

Catalog or Catalogue?: Examining a Library Dilemma by Beall, Jeffrey (2004).
The variant spellings catalog and catalogue create problems for librarianship by causing confusion, hindering research, and betraying the standardization the profession values. The predominant spelling in Britain (catalogue) differs from the predominant spelling in the U.S. (catalog), but within the U.S. both spellings are commonly used. Both of these different practices create inconsistencies. Although the spelling catalog has long been prescribed in the U.S., it has not fully caught on. The spelling catalog is far more common on the Web than catalogue. The best solution to this dilemma for librarians may be to not use this outmoded term at all.