Thursday, November 17, 2005

Classification and the Web

Crunching the metadata: What Google Print really tells us about the future of books by David Weinberger appears in the Nov. 13 edition of the Boston Globe. Wrong on so many counts. He seems to confuse classification with identification. In many libraries the call number is not unique. He seems to ignore subject headings, a book of bird paintings can have subject access both for the artistic and avian content. There is no reason we should not be able to apply multiple call numbers in 050 and 082, the "mark and park" is taken care of in the copies part of the record. But even now most records have both LC and Dewey some have USGS or NASA or NAL or NLM classifications as well.


Bob Doyle said...

Hi David,

I think David Weinberger knows more than a single identifier is needed when he says "we're going to need massive collections of metadata about each book."

But I agree he does stress "the one call number" too strongly.

Bob Doyle
Ed said...

Preach it!

The sad thing is that this lack of knowledge about subject tracings and how useful they are is very common.

I attended a NEASIST event here in Boston, and both the non-librarian speakers made a moderately big deal about how "books could only be arranged one way"


I think this says as much about the non-usability of current online catalogs as anything...

David said...

Here's what I was trying to say, in a highly-compressed article.

Of course subject headings let us classify objects in more than one way. But the number of subject headings under which an object can fall is limited by the physical constraints of card catalogs and books. Further, the physical world requires us to shelve books in one spot and not another. (Multiple copies can be shelved in multiple spots, but that gets messy fast.) So, if we want a collection through which users can roam, we have to make a decision about the primary subject area within which the book will be physically shelved, and then a limited number of other subheadings under which it can be classified (with some number of see-also's). The limit (ten for the LoC, for example) is based not on the number of subject headings that might be relevant but on the awkwardness of physical material.

Digitizing the content as well as the metadata not only removes the limitation, it also allows for richer ways of identifying books one might want to read. Subjects, author and title are obvious ways we want to find books, but there are many more relationships that are useful for locating books we know or don't yet know we want to read. Cf. Amazon for a commercially-inspired -- and plain old inspired -- example.

But, to enable these richer ways of finding books, we need identifiers. IMO (and it's an uncertain opinion), semantics-free global unique IDs are the best choice. The minimal semantics and prevalence of ISBNs make them a good candidate, although there are some obvious problems with them (e.g., they only started in the 1960s). In any case, there's no reason to stick with a single set of GUIDs because computers are good at coordinating multiple sets of related data. So bring on the multiple ID schemes! (I hope Google Print publishes whatever ID's its using internally.)

That's what my piece in the Globe intended to say. If it led readers to a different understanding, then I wrote it badly.