Archive for: September, 2009

IRs, "data," and incentive

Sep 10 2009 Published by under Tactics

Many of my readers will already have seen the Nature special issue on data, data curation, and data sharing. If you haven't, go now and read; it's impossible to overestimate the importance of this issue turning up in such a widely-read venue.

I read the opening of "Data sharing: Empty archives" with a certain amount of bemusement, as one who has been running institutional repositories in libraries for four years. I think Bryn Nelson has confusingly conflated different notions of "data" in his discussion of the University of Rochester's IR.

By the definition Nelson appears to be thinking about, anything digital is automatically data. Thus "dissertations, preprints, working papers, photographs, music scores…" This is not the definition I use for this weblog, nor is it (I believe I am qualified to say by now) what most academic libraries starting IRs had in mind, either.

For purposes of this weblog, the word "data" means the stuff coming out of the research process that isn't prose aimed at a human audience. That's loosey-goosey (most definitions of "data" are), but you get the general idea. A dissertation is not data. Neither is a preprint or a working paper. A photograph might be. A music score probably isn't (how many music scores are research products?). Research data aren't research documents.

(Historians of science, please close your eyes for a bit; you're different. I know research documents are data to you. That's still not what I mean by the term.)

I can't and don't speak for the University of Rochester; I don't know what their IR's collection-development policy is, nor what was going through their minds when the IR was on the planning table. I do know with fair certainty that for most IRs, the problem of data (in this weblog's definition) wasn't so much as a gleam in anybody's eye at the outset. Indeed, for many IRs it still isn't. Libraries started IRs hoping for open access to the journal literature and better access to and preservation of digital gray literature (dissertations, working papers, technical reports, et cetera).

Perhaps Rochester was an exception; again, I don't know. But attributing the emptiness of IRs to a problem with data-sharing makes my head hurt. It doesn't square at all with my lived experience of IRs.

Now, the emptiness problem meant that most if not all IRs expanded their collection and service scope, simply out of necessity. For an excellent, nuanced discussion of this phenomenon, read a Mellon grant report by Carole Palmer et al. Do datasets fall within IRs' purview now? Well… maybe. Depends whom you ask.

I don't want to wander off into the over-technical weeds here, so I'll limit myself to remarking that the technology underlying most IRs (both hosted and roll-your-own) is extraordinarily poorly suited to much research data, having been optimized for documents. This is a serious stumbling block for IRs wanting to expand into data curation.

That problem aside, however, the important question of incentive remains. Even if we accept my division between research documents and research data, who is to say that institution-based data collection will work any better than institution-based document collection? Will data archives remain as empty as IRs?

I think not. Perhaps over-optimistically, but even so. I think not, and here's why.

When IR managers went to faculty, hat in hand, asking for preprints and postprints, they were charging Quixotesquely against a gigantic windmill: the existing scholarly-communication system, which as far as most faculty in most disciplines are concerned works just fine. Filling IRs with the peer-reviewed literature they were established to collect meant changing minds, hearts, and (most crucially) workflows. In practice, it was an impossible dream, especially as IR technology bears more than a little resemblance to the spavined nag Rosinante, and IR managers had little to wield by way of spear or shield.

Right. Now that I have run that metaphor into the ground and stomped its gravesite flat… why are data repositories different?

First, for most disciplines, there simply is no analogue to the existing scholarly-communication system where data are concerned. For pity's sake, we haven't even worked out how to cite data yet! Where researcher workflows and expectations are not yet formed, opportunity awaits.

Second, data repositories and their managers can offer real, meaningful help to researchers, in ways IRs either didn't or couldn't. Publishers are perceived, rightly or wrongly, as providing meaningful service to research; IRs have not achieved that perception, so they have made few meaningful inroads among researchers. Data repositories, data librarians, and data technicians can solve real-world problems that many researchers are already feeling, and many more are likely to feel soon.

Third, the effects of widely-available data are amassing a fairly impressive track record even at this early date. Genomics, economics, literary text-mining, linguistics, name your disciplinary poison: digital data enable answers to more and different questions, faster. IRs? Not so much with the palpable effects (outside ETDs), I'm sorry to say.

Last, the regulatory framework around data seems likely to solidify a good deal faster than the framework around open access to publications. Part of this, of course, comes from not needing to push against entrenched interests, the way the NIH Public Access Policy had to fend off the large-publisher lobby. I understand funder hesitation surrounding the dearth of data standards and the dearth of sustainable repositories, but I'm willing to hazard that the right hands will eventually shake each other for all that to work itself out.

So on balance, I'm hopeful. Nothing is certain; sometimes the many ways to mess this up keep me awake at night. I see motion on so many levels, though—from individual researchers all the way up to huge government funders—that I think data curation is very nearly a foregone conclusion.

Ways and means… well, I have to have some uncertainty around to keep this weblog active.

No responses yet


Sep 09 2009 Published by under Uncategorized

Now that we've looked at how back-of-book indexes endeavor to organize and present the information found in a book, we can consider organizing books themselves. It's quite astonishing, how many people go to libraries and bookstores who never seem to stop to think about how books end up on particular shelves in particular areas. There is no magic Book Placement Fairy!

Let's consider the problems we're trying to solve for a moment. A library has a lot of books, on which ordinary inventory-control processes must operate. So librarians as well as patrons must be able to locate the specific book they're after based on information they have about the book, and once they have the book in hand they must be able to reassure themselves they've found the right book.

What information should librarians capture about the book in order to make this possible? What should they put on the book, and where should they put the book, to make it easier? (Before you answer, consider a book with multiple editions, or purchased in multiple copies.)

The next problem we'd like to take a stab at is enabling patrons to discover useful or interesting books based on the books' physical location. This hasn't always been a desideratum: consider books chained to lectrums, closed stacks, and the more recent phenomenon of offsite book storage. Still, just about any library with open stacks wants the physical location of a book to be a Hansel-and-Gretel breadcrumb trail, leading readers almost invisibly to related materials.

So, just to throw one oft-mentioned possibility out right away, organizing books by cover color is probably not the way to go here… It's also worth mentioning that physicality sometimes interrupts the perfect vision of library classification: "oversize" storage is necessary for books that don't fit on the regular shelves alongside what would otherwise be related materials.

We do have one important constraint to consider: a book is a physical item that can only be shelved in one place. (Multiple copies of a book are just about always shelved together.)

What librarians do to identify books and put related books near each other is called "classification," and as I hope you've guessed by now, it usually involves determining what the book is "about" and what other books are "about" the same or similar things. The phenomenon of bringing together related information packages is called "collocation" in librarian-speak, and is an important principle underlying classification.

(There are exceptions to "aboutness" as the underlying criterion for classification. For example, many public libraries shelve fiction by genre and author rather than "aboutness," and there are longstanding arguments about how best to shelve biographies and memoirs.)

Classification is not an exact science; for one thing, it tends to be contextual. The same book may be in two very different places in different libraries, depending on the contours of each library's collection and the predilections of its patron base. Still, librarianship has developed several classification schemes to assist with this problem… and I'll be discussing some of them in my next post on the subject.

In the meantime… go to your local library and scrutinize the shelves for a bit.

No responses yet

When is text in a PDF not text?

Sep 09 2009 Published by under Miscellanea

I see this confusion so often it seems worth addressing.

If you scan a page of text, what you have is a picture. A computer sees it not as letters, numbers, and punctuation—but as pixels, bits of light and shade and color, just like the pixels in your favorite family photo on Flickr.

You can't search for, extract, highlight, or cut-and-paste such "text." It doesn't matter whether you embed the picture in a PDF; you still can't search it. Ceci n'est pas une texte!

Compare this to creating a PDF from a word-processing or page-layout document. The computer already thinks of the text in these documents as text, so it can embed the text in the PDF as text. The text is thus searchable, extractable, and all that good stuff. (Within limits. PDF is horrible for text-mining, for reasons I may decide to discuss sometime.)

To make the text in a scanned picture searchable, you must use Optical Character Recognition (OCR) technology on the picture. OCR tools look at the picture and try to figure out what letters, numbers, and punctuation it contains. Once you've OCRed the picture, you may embed the text in the PDF along with the picture, whereupon you may be able to search and extract it.

But no OCR, no text, as far as computers are concerned.

Was that clear?

One response so far

Tidbits, 7 September 2009

Sep 07 2009 Published by under Tidbits

Happy Labor Day, US readers. Time to clean out the "toblog" tag on again:

Like many, I am watching the global economy with equal shares fascination and horror. That pursuit led me to this article, which I read through wholly without my librarian goggles on, until I was happily surprised by the kicker at the end for dataphiles:

The last ten years have seen a quiet revolution in the practice of economics. For years theorists held the intellectual high ground.… The typical empirical analysis in economics utilized a few dozen, or at most a few hundred, observations transcribed by hand…

But the IT revolution has altered the lay of the intellectual land… The data sets used in empirical economics today are enormous, with observations running into the millions… But now it is on the empirical side where the capacity to do high-quality research is expanding most dramatically, be the topic beer sales or asset pricing. And, revealingly, it is now empirically oriented graduate students who are the hot property when top doctoral programs seek to hire new faculty.

Not surprisingly, the best students have responded. The top young economists are, increasingly, empirically oriented. They are concerned not with theoretical flights of fancy but with the facts on the ground. To the extent that their work is rooted concretely in observation of the real world, it is less likely to sway with the latest fad and fashion. Or so one hopes.

The ability to acquire and manipulate large datasets is changing the entire discipline of economics, is how I read that. That's quite a strong statement.

I have more, but they need to wait until I finish the megaposts on library classification. I promise I'm working on them!

No responses yet

Welcome AL Direct readers!

Sep 03 2009 Published by under Metablogging

I found out from a few different sources (thanks, all!) that my post about back-of-book indexes made it into American Libraries Direct yesterday.

Welcome to any and all new readers! I hope you stick around. I'm going to tackle classification next…

No responses yet

Migration versus emulation

Sep 02 2009 Published by under Praxis

Just a quickie post today—

In answer to my post about intertwingularity, commenter Andy Arenson suggested that the way to rescue an Excel spreadsheet whose functions or other behaviors depended on a particular version of Excel was to keep that specific version of Excel runnable indefinitely.

This is called "emulation," and it assuredly has its place in the digital-preservation pantheon. Some digital cultural artifacts are practically all behavior—games, for instance—and just hanging onto the source code honestly doesn't do very much good. The artifact is what happens when that code is run, which means preserving it means keeping that code runnable, which in turn means preserving its runtime environment as best we can.
No mere bagatelle, this. If you turn up your nose at games (which you really, really shouldn't), consider the humble Hypercard stack from the 1990s. A good many enterprising artists and designers built rather remarkable things on it, as well as over other bits of the early Macintosh systems environment—and all those things are right this minute in danger of disappearing forever because we can't emulate that environment sufficiently well to rescue them.

For most data, though, I honestly prefer a "migration" strategy, in which format obsolescence is fought by modifying files to keep them usable in modern hardware and software environments. Hardcore emulationists disagree with me; I've seen articles boasting that any environment in the history of computing is trivial to emulate, so why even bother with migration? Frankly, I don't believe a word of it. If it were that trivial, it would have been done already. It hasn't.

I prefer migration because emulation feels like putting the data in a museum: look all you want, but don't touch. Data should be touchable, rearrangeable, mashup-able; a good migration will keep them so. Also, in general migration is much less of a reach for memory organizations than emulation. Take me, for instance. I'm a tolerably talented data migrator. I can't do anything with emulation.

Migration itself is not always trivial and can be lossy. My friend Tim Donohue developed (and won a conference prize with) a DSpace hack that sends Microsoft Office files through a copy of running on the DSpace server, saving ODF versions of the files to DSpace along with the Office versions. Worked like a charm, as far as it went. What was the problem? FONTS. Because the server had a minimal font complement at best, the ODF files came out looking unusably horrible.

Migration is sometimes impossible, if the origin format is proprietary, opaque, or otherwise not reverse-engineerable. Unfortunately, emulation has limited if any success in this situation as well; if the file format is obfuscated, so is the software environment, generally!

Of course, the gold standard is a research workflow that respects data enough to put thought and care into describing it and using future-friendly formats right from the beginning. We don't live in that world, and we may never live in that world… so the migration-versus-emulation wars are only beginning.

7 responses so far

The problem of "expert location"

Sep 01 2009 Published by under Praxis

A common problem adduced in e-research (not just e-research, but it does come up quite a bit here) is expertise location, both local and global.

You need a statistician. Or (ahem) a metadata or digital-preservation expert. Or a researcher in an allied area. Or a researcher in a completely different area. Or a copyright expert (you poor thing). Very possibly the person you want works right down the hall, or in the building next door, or in the library, or somewhere on campus. But how on earth do you know?

You could call around to the offices or departments most likely to contain the expertise you're after. (Calling University Legal with a copyright question is a no-brainer, for example.) What if you don't know if the expertise exists locally? Or which office or department it's in? More to the point, what if the usual point of contact at the office or department doesn't have an encyclopedic knowledge of the office's or department's expertise? (Consider a brand-new departmental secretary. Or a library reference desk—how much does J. Random Librarian know about who in the library can answer metadata questions? Or will the caller be referred to IT?)

Looking at this problem from another angle turns it into "How do we collect information about what our staff knows about?" Well, there are sources. Curricula vitae. Institutional repositories. Publication lists. Statements of research interest. Titles of taught courses. Problem is, these bits of information sprawl all over an institution's web space, when they exist at all—and when they exist, they're hardly ever up-to-date.

The annual review process ought logically to update these information stores. I had a frank talk with a faculty member a month or so ago in which he admitted that faculty in his department lavished much effort on annual review documents that then… vanished into thin air, or something like that. Nothing made it onto the department's web space, much less the institutional repository. Nothing even made it into the department's newsletter. The faculty member deplored this situation, but felt powerless to change it.

Well, some libraries are taking up this gauntlet. (Bias disclosure: mine, as co-developer of the BibApp, is one of them.) No, publication lists aren't the be-all and end-all of expert location, but they're a whole lot better than nothing, so if we can centralize them and make them easier to update, why not? And if updating can be streamlined through author-search RSS feeds from the library, great!

Cornell's VIVO was close to the first publication database in the United States, and is still the liveliest. This isn't just a big-school frill, however; Appalachian State University's library maintains one. It's worth noting, however, that as usual the United States is behind the curve; because of government-funding allocation rules, nearly all British universities have such a beast and feed it regularly.

Publication databases as expert-location tools. Coming soon to an institution near you?

2 responses so far

« Newer posts