Librariansplaining: The controlled vocabulary

Aug 05 2010 Published by under Uncategorized

Fellow Scientopian Christina Pikas posted an examination of Stack Overflow's motion toward a controlled tagging vocabulary. Toward the end, she made me grin:

Ok, one of my ongoing jokes is how CS keeps reinventing LIS (well indeed they’ve taken over the term “information science” in some places) – so now Stack Overflow has reinvented taxonomy (not quite a thesaurus though, right, because no BT or NT just UF and U, lol)

A lot of librarians, me not least, grumble "we told y'all so" when we see computer science reinventing our wheels. What this means, of course, is that librarians haven't done nearly good enough a job explaining our wheels.

This is what Book of Trogool's "Jargon" category is all about. I mean to rename it to "Librariansplaining" (I'm sure Zuska or Janet will explain the coinage, if it isn't obvious already) as soon as I can sort out how to do that without borking category links.

And now I'm going to librariansplain about controlled vocabularies, and explain Christina's in-joke. It may help you to read some of the earlier posts in this category first:

In those posts, I talked about how librarians divide up the world of knowledge into teensy-tiny slivers of "aboutness" in order to help lead you from one item of interest to another. One of the pieces of dividing up knowledge is naming the slivers. When you start doing that, as Christina noted, you run into some human-language problems really quickly:

  • Synonymy. Istanbul or Constantinople? It's our business, as well as the Turks'.
  • Homonymy. I say "bat." Do you say "Chiroptera" or "baseball"? And if librarians decide to use the word for the baseball apparatus, what should we do so that the Chiroptera-fanciers can find stuff they want?
  • Terminology change. Nobody calls it a "horseless carriage" any more. To make matters worse, the first name something new gets is often not the name that sticks. Social changes also loom large here; some of the cruft that can accumulate in a naming system is kyriarchical cruft.
  • Granularity. Knowledge is infinitely divisible. Naming systems have to decide at what level separate names are warranted. It can also help to indicate relationships up and down the granularity chain; for example, one could call "weblogs" a subcategory of "social software." Or not.

So when librarians "control a vocabulary," we come up with a naming system that avoids the above pitfalls as much as we can manage.

Various types of controlled vocabularies exist; I don't propose to describe them all here. Instead, I'll describe the type that Christina was referring to: the thesaurus. (No, not the synonym dictionary. This is different. Hang with me while I explain.)

Thesauri cope with granularity by establishing "broader-term" and "narrower-term" relationships between terms. So in an entry for "Social software" you might see "NT: weblog, wiki, social-networking service." Likewise, in a "Weblog" entry you may well see "BT: social software." This doesn't absolve the vocabulary-builder of the responsibility to choose the granularity of terms wisely, but it does help.

Homonymy and synonymy are often dealt with via "use" and "use for" relationships. If a vocabulary-builder decides that Istanbul is the preferred term, the entry for it will probably include "UF: Constantinople." Likewise, Constantinople's entry will say "U: Istanbul." This can also help with terminology change sometimes: an entry for "Automobile" might contain "UF: Horseless carriage."

As for "bat," controlled vocabulary terms often have "scope notes" that help to disambiguate homonyms and explain the intended granularity for the term. A scope note would make clear that "bat" for purposes of this vocabulary means the thing you smash a homer over the left-field fence with.

The last relationship between terms that thesauri include is the "related term," which is exactly as vague as it sounds. In an entry for "bat" you might see "RT: Baseball." These have to be used sparingly and with care, or we risk sending you off on wild-goose chases; in some way or other, almost everything is related to almost everything else.

So now I have librariansplained the thesaurus, and you understand Christina's joke. The last thing I'll add is that many library journal-article databases use thesauri underneath. The user-interface for them, however, is appallingly, stunningly bad in the implementations I know. Better UI ideas would be extremely welcome.

14 responses so far

  • jerry anning says:

    While you are librainsplaining things, I wonder if you could librariananswer a question I have. When you are tasked with establishing a new library, obviously you need to make sure you have a reasonable selection of material people are likely to want. To this end, is there some sort of prioritized list of books a library should make sure to have, with say Shakespeare at the top and some obscure 19th century novelist near the bottom? The idea is to avoid spending your book budget and realizing "dammit, we forgot Dickens! or what? no biology?"

    • Dorothea Salo says:

      Well, in a word... no.

      This is because there are different kinds of libraries with different patron audiences, and librarians are necessarily sensitive to that. What a brand-new library for a new engineering program on a university campus needs is night and day compared to a brand-new library in a small town, to point out extremes.

      There are certainly places to start. If you're building a new children's section in a public library, of course you look at awards lists like the Newbery and the Caldecott and the Coretta Scott King. The library professional literature also has scads of articles like "starting a graphic novel collection? here are some to consider." There are also reference books for books-by-genre; I happen to have coauthored one (sorry about that cover; I had NO INPUT WHATEVER), and there's also the fantastic Genreflecting series.

      Keep in mind also that new books are published all the time; how are you going to keep a must-have list up-to-date? Collection developers pay attention to reviews (there are specialized periodicals partly or entirely devoted to book reviews), book blogs, buzz on places like LibraryThing, media tie-ins, etc. etc. ad nauseam. And of course we listen to patron requests (though try not to request your own self-published book, 'k? we're wise to that one) and pay close attention to circulation stats and other measures of usage. ("Hm, our ten books on graphic design are getting a real workout lately. Maybe it's time to beef the collection up in that area.") Public libraries also pay attention to changes in local demographics. In my area the Spanish-speaking population is burgeoning, and we have a first-generation Hmong population of long standing, so the local library system collects for those communities as well.

      Um. Collection development is complicated? Last spring I taught a "topics in" course on it.

      • Nicole says:

        Oh, the cover of your book! I can't decide if it is perfect or perfectly awful, but I do enjoy your protestations. I mean, it is about *fantasy* after all... 😉

        • Dorothea Salo says:

          It is an awful cover. A wretched cover. A cover that caused me to lose faith in cover artists and book editors.

    • Christina Pikas says:

      Actually, there are "core" lists for several areas in the sciences. It would be overly simplistic to say you could just buy those and be good, of course. They're also useful for weeding- now D's going to have to explain what I mean by weeding. Areas I know of with core lists are environmental science (updated by EPA librarians) and Astro.

      In the end your collection has to answer the questions or fill the information needs its users have. The only way this is possible to approach this state is through ongoing and forever communication between the users and the librarians, and ongoing and forever selections and deselections.

      • Dorothea Salo says:

        See, I take YOU with me places because you know stuff about science libraries that I completely and utterly don't.

      • Beth Brown says:

        There's also an online recommended title list in nursing/medicine, Doody's (formerly Brandon/Hill.) Mathematics and chemistry have recommended title lists as well, although they are in book form and somewhat dated. As Dorothea and Christina mentioned, most libraries build a collection by using these lists and soliciting suggestions from patrons and staff.

  • Christina Pikas says:

    Hey, I need to take you with me places so people know what I mean - I get a dumbfounded look, I turn to you, you say it in English, everybody's happy 🙂

  • Dave Lull says:

    And there are core lists that cover all areas of library collections; for examples:

    Resources for College Libraries (RCL), "the premier core list for academic libraries"

    The Wilson Core Collections

  • A colleague is looking for a taxonomy of *kinds* of terminology change, to apply to linked data/identifiers. Does that make sense? And where would you start looking for something like that? Seems like authority control folks would have something to say there!

    PS-Scientopia is lovely! Great theme here, and MUCH more relaxing without ads!

  • alexander says:

    "Controlled vocabulary" to me sounds like someone mapped out all the words people will be able to use to describe data in the future and then try and keep them in that cage. What Stack Overflow has done is allowed people to tag things freely, then noticed patterns of synonyms and adapted the search function to take account of the fact that someone will probably want results for both terms if they search for one.

    Doing this when planning your interface is a cart-before-the-horse anti-pattern in my mind. It's also not great to tell the user they've made an error by tagging something as "VB" if your system prefers "Visual Basic". Just accept the tag and let the search engine do the work.

    • Dorothea Salo says:

      Controlled vocabularies aren't set in stone. They do change, although in fairness, sometimes too slowly.

      The basic problem with permitting freeform tagging is exactly what Stack Overflow ran into: leaving synonymy alone means people just plain don't find everything they're looking for. Rather than letting search engines try to deal (which they tend to be poor at, incidentally), it often makes more sense (and yields more efficient searches) to use U/UF relationships behind the tagger to reduce synonymy up front, and behind the search engine to guide searches and (gently) teach searchers what the preferred terms are.