Archive for the 'Uncategorized' category

Librariansplaining: The controlled vocabulary

Aug 05 2010 Published by under Uncategorized

Fellow Scientopian Christina Pikas posted an examination of Stack Overflow's motion toward a controlled tagging vocabulary. Toward the end, she made me grin:

Ok, one of my ongoing jokes is how CS keeps reinventing LIS (well indeed they’ve taken over the term “information science” in some places) – so now Stack Overflow has reinvented taxonomy (not quite a thesaurus though, right, because no BT or NT just UF and U, lol)

A lot of librarians, me not least, grumble "we told y'all so" when we see computer science reinventing our wheels. What this means, of course, is that librarians haven't done nearly good enough a job explaining our wheels.

This is what Book of Trogool's "Jargon" category is all about. I mean to rename it to "Librariansplaining" (I'm sure Zuska or Janet will explain the coinage, if it isn't obvious already) as soon as I can sort out how to do that without borking category links.

And now I'm going to librariansplain about controlled vocabularies, and explain Christina's in-joke. It may help you to read some of the earlier posts in this category first:

In those posts, I talked about how librarians divide up the world of knowledge into teensy-tiny slivers of "aboutness" in order to help lead you from one item of interest to another. One of the pieces of dividing up knowledge is naming the slivers. When you start doing that, as Christina noted, you run into some human-language problems really quickly:

  • Synonymy. Istanbul or Constantinople? It's our business, as well as the Turks'.
  • Homonymy. I say "bat." Do you say "Chiroptera" or "baseball"? And if librarians decide to use the word for the baseball apparatus, what should we do so that the Chiroptera-fanciers can find stuff they want?
  • Terminology change. Nobody calls it a "horseless carriage" any more. To make matters worse, the first name something new gets is often not the name that sticks. Social changes also loom large here; some of the cruft that can accumulate in a naming system is kyriarchical cruft.
  • Granularity. Knowledge is infinitely divisible. Naming systems have to decide at what level separate names are warranted. It can also help to indicate relationships up and down the granularity chain; for example, one could call "weblogs" a subcategory of "social software." Or not.

So when librarians "control a vocabulary," we come up with a naming system that avoids the above pitfalls as much as we can manage.

Various types of controlled vocabularies exist; I don't propose to describe them all here. Instead, I'll describe the type that Christina was referring to: the thesaurus. (No, not the synonym dictionary. This is different. Hang with me while I explain.)

Thesauri cope with granularity by establishing "broader-term" and "narrower-term" relationships between terms. So in an entry for "Social software" you might see "NT: weblog, wiki, social-networking service." Likewise, in a "Weblog" entry you may well see "BT: social software." This doesn't absolve the vocabulary-builder of the responsibility to choose the granularity of terms wisely, but it does help.

Homonymy and synonymy are often dealt with via "use" and "use for" relationships. If a vocabulary-builder decides that Istanbul is the preferred term, the entry for it will probably include "UF: Constantinople." Likewise, Constantinople's entry will say "U: Istanbul." This can also help with terminology change sometimes: an entry for "Automobile" might contain "UF: Horseless carriage."

As for "bat," controlled vocabulary terms often have "scope notes" that help to disambiguate homonyms and explain the intended granularity for the term. A scope note would make clear that "bat" for purposes of this vocabulary means the thing you smash a homer over the left-field fence with.

The last relationship between terms that thesauri include is the "related term," which is exactly as vague as it sounds. In an entry for "bat" you might see "RT: Baseball." These have to be used sparingly and with care, or we risk sending you off on wild-goose chases; in some way or other, almost everything is related to almost everything else.

So now I have librariansplained the thesaurus, and you understand Christina's joke. The last thing I'll add is that many library journal-article databases use thesauri underneath. The user-interface for them, however, is appallingly, stunningly bad in the implementations I know. Better UI ideas would be extremely welcome.

14 responses so far

Authority control, then and now

Dec 18 2009 Published by under Uncategorized

Since the end of the year is a fairly quiet time for my particular professional niche, I've taken the opportunity to do some basic name authority control on author name-strings in the repository.

Some basic what on what, now? Welcome back to my series on library information management and jargon.

The problem is simple to understand. Consider me as an author. I took my husband's surname upon marriage; fortunately, I hadn't published anything previously, but I might have done—and if I had, how would you go about finding everything I've written, if it was published under two different names? "Dorothea" is a fairly distinctive given name, especially in my age cohort, but I do share it with other creators.

Now consider creators whose names are not written in Roman characters. The many and varied romanizations of the composer Tchaikovsky may give pause, though my personal favorite example is a certain Libyan leader who wrote a book or two. (Click over and then hit the plus beside "400's: Alternate Name Forms.")

Libraries confronted this problem when the search technology of choice was the card catalogue. The outline of a solution emerges: to avoid wasteful duplication of cards, all the cards representing titles by a given author should be in one place under one name, but it should also be possible to pop in a single card for each additional name variant so that searchers know which variant is hiding the good stuff. ("Chaikowsky, Peter Ilich: see Tchaikovski, Piotr Ilyich, 1840-1893.")

This means choosing a preferred name variant, of course. Ideally, we'd like this to be consistent across libraries, so that the devotee of Russian music who learns the preferred variant in her home library will easily find what she needs at any other library.

There are additional wrinkles as well: it does happen that different authors wind up with the same name, and for library purposes, that's no good. My husband David, for example, shares his name with a book-writing swimming coach. Libraries chose to use birth years—and, only if necessary, death years—to disambiguate.

Aha, you say. This is why not all author names in library catalogues have attached dates. This is why not all authors with listed birth dates have death dates, even when they'd have to be older than Methuselah to be living still. Yes, this is why. Dates in author headings started strictly as a disambiguation measure; the swim coach didn't have his birth year beside his name until my husband turned up and wrote a book. Of late, there have been raucous arguments among cataloguers in libraryland about adding death dates as a matter of course.

All of this activity—choosing preferred name variants such that each name listing remains unique, listing other name variants with the preferred, organizing by-author displays accordingly, coping with name changes—is called "name authority control." (It has an analogue for subject work, sensibly enough called "subject authority control." This verges on the topic of controlled vocabularies, which is definitely one for another post. Or six.) For catalogue cards, this solution is remarkably elegant and entirely functional. For computer-based record management—well.

Relational-database experts are howling right now, at the idea that a primary key—what's used to identify a particular row of information, a particular item, in a database—would ever change. The whole point of a primary key is its immutability! Ask for record number 91346342, always get the same record. You never, ever, ever change that record ID. Ever. Really, not ever. If a particle of information can change, it shouldn't be used as a primary key!

Linked-data experts are howling as well: why don't all these people have URIs? (If you remember your analogies from the SAT, database:primary key::RDF:URI. Roughly, anyway.) Well, they do, now, thanks to VIAF. Here's my VIAF URI (no, I have no idea why my birth year is included in my authority string, as my name by itself is unique in authority data; ask a cataloguer) to look at. Feel free to hunt for your own URI.

To some librarians, all this business of immutable identifiers may sound like specious wrangling, but it's not: it's actually a major disjunction among cataloguing practice, the databases underlying ILSes, and the perennially-emerging world of linked-data mashups via RDF. Inexpert programmer that I am, the idea of programming around library methods of authority control makes my head hurt. It leads to real problems making online catalogues work well (never mind library systems that aren't tied into authority control, such as digital-library platforms and institutional repositories), and making library data play nicely with other people's data. When gearhead librarians and other technologists say "library data is siloed," this is exactly the sort of thing they mean.

You may, particularly if you are a hard scientist, have noticed another hole in this system: you don't get into it unless you have written a book. (Exceptions, yes, for editors and composers and book illustrators and whatnot. However.) I, for example, had two or three articles and book chapters come out before co-authoring a book published in 2008. I didn't have an authority record until the book was catalogued. If all you've published are articles, you don't have an authority record, sorry.

This is becoming a serious problem! If it were just people like me struggling with it, that wouldn't signify; as a librarian, I'm supposed to struggle with this sort of thing. I learned hotshot DIALOG-searching tricks in library school to get around article databases' lack of name authority control, for instance. Right now, I've built up a strategy for finding physicists' and engineers' first names that mostly works, though I do wish whatever weird graduate-school midnight hazing ceremony that deprives these worthy people of their given names in favor of their initials would wither away and die. (I am joking. Mostly. This phenomenon, though of course it isn't the result of hazing, can be maddeningly difficult to rectify, especially when the author in question is a graduate student who either doesn't graduate or doesn't go on to an academic career.)

No, the real problem concerns the changing nature of performance measurement in academia, mostly in the sciences to date. As journal impact factors wane in importance (not nearly fast enough for me!), the importance of measuring the impact of individual articles and other publications via citations and download counts rises. How are we to measure this anything like correctly for a given author if we can't reliably match articles to authors?

In an article published earlier this year, I wrote that there was a ferment of activity around the question of author authority, and what would come of it all was far from clear. I'm happy to say that clarity is emerging, in the form of ORCID: the Open Researcher and Contributor ID initiative. This effort looks to me to have critical mass and brainpower to make a difference: publishers, libraries, technologists, and research funders are all involved.

In the meantime, I plod through the repo's author listings, making what minimal order I may, very desirous of a better solution.

8 responses so far

Classification and a bit of subject analysis

Oct 26 2009 Published by under Uncategorized

It's been a while since I did anything on my series about library ways of knowing. If you'd like to refresh your memory:

Today I'll finish my discussion of classification, and distinguish it from subject analysis, since that distinction often seems to confuse, especially in our digital age.

So if we'll recall, the goal we set for ourselves was to collocate physical books on shelves in such fashion that their arrangement would be useful to information-seekers. With most non-fiction, that means collocation by subject, by what the books are about.

(There are lengthy philosophical discussions of "aboutness" in the information science literature. I recommend avoiding them with all your strength. They make my eyes bleed.)

To make this work, we have to map knowledge-space onto physical space: divide up human knowledge into convenient slots to assign books to. This is, you might say, a tall order: an ontology of infinite domain, but where each item can only fit in one place.

In the States, most libraries use one of two such maps: the Dewey Decimal System or the Library of Congress Classification. About the kindest thing one can say for Dewey Decimal is that it was a product of its peculiar time; for today's purposes, it is heavily overnumbered in religion, for example, and undernumbered in science. Perhaps worse, its sense of the world is not exactly immediately intuitive to the modern eye: why the long separation of geography from the so-called "social sciences," of which psychology is apparently not one?

This is one danger of any would-be universal classification. Our sense of the world and its knowledge changes over time, sometimes quite a lot and quite suddenly. If our ontology doesn't keep up, it serves its purposes less and less well. How easy is it, really, to find the right shelf in a library of any size organized by Dewey Decimal? Considerations such as these no doubt informed the shift of one library (and later others) to the BISAC codes typically found in large bookstores.

Another danger of the universal classification is that its specificity is of necessity somewhat limited. Many medical libraries, for example, ditch Library of Congress Classification because it just doesn't drill down far enough into medical minutiae for their needs. The NLM Classification fills the gap.

With physical books, we cannot escape the constraint that each book must go in one and only one place on the shelf. Once we're away from the physical item, that constraint disappears. The card catalogue was the first desperately clever escape from the tyranny of the physical item: in a card catalogue, the same book could be "shelved" by author, title, and one or more (usually three to five, to avoid overproliferation of cards) subjects assigned to it by the cataloguer.

This meant the addition of a subject-heading system to the classification vocabulary. You can't just add more classification numbers to the physical item; you then imply that it goes in more than one place! This is the difference between Library of Congress Classification and Library of Congress Subject Headings. Under most circumstances, the LCC number assigned to a book will correspond closely in meaning to the first LCSH assigned in the book's catalogue record. They are still distinct systems, however! Don't confuse them. Librarians chuckle behind their hands.

Of course, digital items don't have to live in just one space. Classification is therefore slowly giving way to subject analysis and similar ways of relating items to each other as digital libraries develop.

And that, in a remarkably simplified nutshell, is how books are arranged on shelves in libraries. It doesn't happen by magic!

No responses yet


Sep 09 2009 Published by under Uncategorized

Now that we've looked at how back-of-book indexes endeavor to organize and present the information found in a book, we can consider organizing books themselves. It's quite astonishing, how many people go to libraries and bookstores who never seem to stop to think about how books end up on particular shelves in particular areas. There is no magic Book Placement Fairy!

Let's consider the problems we're trying to solve for a moment. A library has a lot of books, on which ordinary inventory-control processes must operate. So librarians as well as patrons must be able to locate the specific book they're after based on information they have about the book, and once they have the book in hand they must be able to reassure themselves they've found the right book.

What information should librarians capture about the book in order to make this possible? What should they put on the book, and where should they put the book, to make it easier? (Before you answer, consider a book with multiple editions, or purchased in multiple copies.)

The next problem we'd like to take a stab at is enabling patrons to discover useful or interesting books based on the books' physical location. This hasn't always been a desideratum: consider books chained to lectrums, closed stacks, and the more recent phenomenon of offsite book storage. Still, just about any library with open stacks wants the physical location of a book to be a Hansel-and-Gretel breadcrumb trail, leading readers almost invisibly to related materials.

So, just to throw one oft-mentioned possibility out right away, organizing books by cover color is probably not the way to go here… It's also worth mentioning that physicality sometimes interrupts the perfect vision of library classification: "oversize" storage is necessary for books that don't fit on the regular shelves alongside what would otherwise be related materials.

We do have one important constraint to consider: a book is a physical item that can only be shelved in one place. (Multiple copies of a book are just about always shelved together.)

What librarians do to identify books and put related books near each other is called "classification," and as I hope you've guessed by now, it usually involves determining what the book is "about" and what other books are "about" the same or similar things. The phenomenon of bringing together related information packages is called "collocation" in librarian-speak, and is an important principle underlying classification.

(There are exceptions to "aboutness" as the underlying criterion for classification. For example, many public libraries shelve fiction by genre and author rather than "aboutness," and there are longstanding arguments about how best to shelve biographies and memoirs.)

Classification is not an exact science; for one thing, it tends to be contextual. The same book may be in two very different places in different libraries, depending on the contours of each library's collection and the predilections of its patron base. Still, librarianship has developed several classification schemes to assist with this problem… and I'll be discussing some of them in my next post on the subject.

In the meantime… go to your local library and scrutinize the shelves for a bit.

No responses yet

The humble index

Aug 25 2009 Published by under Uncategorized

I'd like to start our tour of book and library information-management techniques with a glance at the humble back-of-book index. I started the USDA's excellent indexing course back in the day, and while it became clear fairly quickly that I do not have the chops to be a good indexer and so I never finished the course, I surely learned to respect those who do have indexing chops. It's not an easy job.

Go find a book with an index and flip through it. Seriously, go ahead. I'll wait. Just bask in the lovely indentedness and order of it all.

Now answer me a question: Should Google be calling that huge mass of crawled web data it computes upon an index?

Arguably, it shouldn't, though this is absolutely a lost battle; the word "index" is polysemous and always will be. What Google has is more along the lines of a concordance of the web. What's a concordance, you ask? A list of words in a given corpus of text, along with pointers to where those words are used in the corpus. Way back in the day, compiling concordances to important literature (e.g. religious texts) was considered worthy scholarly work. Today, of course, feeding a text into a computer can yield a concordance in seconds—I'm no great shakes as a programmer, but even I could hack up some concordance software if I had to.

Google's index is a bit more than a straight-up concordance: they do stemming and some n-gram analysis and other fancy-pants tricks. But it is still qualitatively different from a back-of-book index. How? I'll adduce three major differences: human intervention, terminological insularity, and intentional grouping.

There is a standard documenting what an index is for and how to create one. I'm not paying over $250 to own it, but I'll happily give you the gist.

An indexer presented with a book reads it at least twice, with concentrated attention. She is looking for concepts that the book treats usefully and/or in some depth, because an index containing every passing mention of everything is usually useless to someone asking "does this book have useful, original information on topic X?"

(I did say "usually." Sometimes a topic is so terribly remote or abstruse that even the slightest mention is useful. That's when a concordance can be superior to an index. Google Books is a godsend to lovers of minutiae.)

Please note that I said concepts, not "words" or even "phrases." A recurring problem in information management is that human language is truly annoying about using different words for the same thing, in various sorts of ways that this post is already too long to discuss in depth. Suffice to say that part of the indexer's job is to tease out concepts in the text that aren't necessarily labeled consistently or even labeled at all. A text on web design may never actually use the word "usability," for example, but that doesn't mean it has nothing to say about the subject! A good indexer will work that out.

So how does an indexer label the concepts she finds? Well, ideally, the text has done that for her; that's why an index is more insular than Google, which makes considerable use of other people's labels for web pages insofar as those are discoverable through links. (That's what Googlebombing is all about, of course.) The indexer is not slavishly bound to the text's language, however. She is allowed to take into account the text's readers, and (what she believes to be) their language use.

An indexer will not lightly discard the text's usage. What she will do is use "See" entries to connect likely reader usage to the text's usage. If the aforementioned web-design text casually throws in "HCI" without ever expanding it (shame on the editor! but it does happen), a smart indexer will throw in an entry "Human-computer interaction. See HCI." Remember this trick. We will see it in other forms later.

A See entry is not the same as a See also entry. See entries are intended for more-or-less synonymous terms. Rather than wastefully repeat the entire litany of page numbers for every synonym of a given term, pick the likeliest term (probably the text's most-often-used term, but again, the indexer has some discretion) and point the other synonyms to it. See also entries are for related terms, other information that in the indexer's judgment a reader might be interested in.

See also entries are another example of the grouping function of an index, alongside the entire idea of bringing mentions of the same concept that have been scattered throughout the text together in a single entry. Google does not do this save by haphazard accident. A few other search engines try (have a look at Clusty), but the results typically don't make entire sense—and why should they? they're using algorithms to do a human's job!

Purely mechanical questions such as page count enter into index compilation as well; publishers reserve a certain number of pages for the index (or in the hurly-burly of typesetting, a certain number of pages become available), and the index must be chopped to fit. You can imagine, I'm sure, that it's much harder to do a short index than a longer one!

Indexing electronic books introduces user-interface and workflow questions. The print book has the immensely convenient "page" construct to use as a pointer. The electronic book may have pages—or it may scroll, or the page boundaries may change according to font size, or… you see the UI problem, I trust. It's not insoluble, but it's annoying. The workflow problem is simple: how (and when in the production process) does the poor indexer mark the places a given entry should point to?

When I was doing ebooks back in the day, these problems hadn't been solved yet. I worry sometimes that if they remain unsolved, the noble art of book indexing will wither and die—and the search engine, as I hope you now understand, is not an entire replacement.

Go back and flip through that book index again. Appreciate it a little more? Excellent.

8 responses so far

XML and cows

Jul 23 2009 Published by under Uncategorized

Because I've seen it quoted, misquoted, and usually not attributed at all… “Converting PDF to XML is a bit like converting hamburgers into cows." That is the quote I know of. It comes from revered XML developer Michael Kay on the xml-dev mailing list in July 2006.

It's possible Kay got this from somewhere else, but I've never seen an earlier attribution. (Comments are open if I'm wrong.)

I hear all sorts of chest-beating about attribution in data circles, often for good and sufficient reason. I think we can stand to get our quotes and their authors right.

2 responses so far

What is e-research?

Jul 17 2009 Published by under Uncategorized

That would be the question, wouldn't it. Unfortunately, such fundamental definitions are never simple to create, and even less simple to agree upon. A little history may help explain how we got into this parlous uncertain state, but it may not get us out of it.

The short version of the history (which all and sundry may feel free to correct in the comments) is that the Anglophone world had a terminology breakdown right from the start: what the English called "e-science" the Americans (with our customary tin ear) dubbed "cyberinfrastructure." Then the humanities reared back on their hind legs brandishing their claws like heraldic gryphons and demanded to know why they weren't included in the discussion, which led to the umbrella term "e-research" that I prefer to use.

So we have a term. Several terms, in fact. Do we have definitions for them? Well… no, not particularly, to be perfectly honest. Being a hardheaded sort, I prefer to define by praxis. What is it that e-research does? What do e-researchers do that other researchers might not?

One thing is "use grid computing to tackle otherwise intractable problems." Grid computing is the modern version of the Cray supercomputer—more computing power than you can possibly imagine. Only it's done not with one gigantic machine, but with hosts of ordinary machines yoked together by specialized software. Think about a Google server-farm, and you aren't too far wrong.

Another thing is "generate and analyze huge piles of digital data." Instrument science, all sorts of imaging, text-mining, survey data, observational data—it's all piling up. Back in the day, there wasn't much that could be done with raw (or even cooked) data other than boiling it into a graph or chart or table for a published journal article. The advent of computers has changed that irrevocably, and data show signs of becoming just as much a first-class citizen in the research polis as the journal article or the monograph.

This, of course, creates the problem of holding onto digital data—sometimes in shocking quantity—and keeping it useful and accessible. Again, our practices haven't caught up with us on this. Some disciplines have well-established lab-notebook cultures. Many don't. Almost no disciplines have established digital-data standards and practices; the quantitative social sciences are a long way in the lead thanks to enlightened research-data centers such as ICPSR, but other fields are gamely working to catch up.

"Data curation," as it is often called, is my major professional interest in the e-research firmament, so you can expect to see it discussed often here. I am partial to Melissa Cragin's definition: "the active and ongoing management of (research) data through its lifecycle of interest and usefulness to scholarship, science, and education." I hope to unpack this definition in a future post.

A last behavior thought to set e-researchers apart is computer-enabled collaboration in various forms, from the breaking-down of institutional barriers to the spread of inter- and multi-disciplinary research teams. Social networking is often mentioned in this context, though sometimes with a bit of a sneer. Even the humanities, where scholars tend to self-define as solitary and esoteric, are beginning to find that the life of the mind can be usefully shared.

Does what I've said accord with your impression of e-research? Comments are open!

9 responses so far

« Newer posts