Archive for: October, 2009

L'esprit d'escalier

Oct 29 2009 Published by under Praxis

If you're not reading comments here, you're missing out. For reasons I don't entirely understand, some of the best in the business are seeing fit to comment here. They have more to teach than I do!

Chris Rusbridge (of, among other things, this thought-provoking meditation on digital preservation) has been spotted here, and whenever he pops up he makes me think about things. This time, I was thinking about disciplinary expertise, and how I need to make a better case that less of it is necessary for data curation than generally admitted.

I hope we can at least admit that data curators don't have to be researchers themselves. Do researchers have to be involved in the curation of their own data? Absolutely! Data curation starts at the beginning of the study-design process, and continues all the way through and past publication. But that doesn't mean that researchers have to do everything. The exact division of labor is still being sorted out; that's partly what this blog is about. That the labor must and will be divided appears to be beyond dispute.

The corollary to this is that a data curator will almost always know less about the data, viewed from certain axes, than the researcher does. She may well know more about it viewed from some other axes—file format details, metadata crosswalking, whatever. Some things, though, she won't know and presumably won't have to.

So what does she have to know about the research and the discipline in order to be a responsible data steward? And does she have to walk into the process with that knowledge pre-existing, or can she learn it as she works on the research project? How much of what she needs to know will transfer from other projects she's worked on?

Cards on the table: in the absence of much evidence either way, I think that someone with the intelligence, disciplinary background, and intellectual curiosity of a good subject-specialist librarian can learn enough "on the job" to hit the 80/20 point pretty easily—and 80/20 is more than good enough for a successful campus data-curation program in my book. The other 20% of edge cases can hire specially.

I'll use a True Story about myself as an anecdote. Feel free to quarrel with me (civilly, please) in the comments.

Some years ago I did a small contract job for the ACLS E-book project. They were working on rekeying and marking up an art-history book with extended segments of polytonic Greek text. Their keying vendor took one look and said "no way do we key polytonic Greek." So ACLS told them to key the rest of it and leave placeholders for the Greek. They came to me asking whether I could key the Greek in proper Unicode without snarling up the markup.

I have never studied Greek. I do not speak Greek. I do not write Greek. I do not read Greek, except in the sense that I recognize the letters and can laboriously sound them out. Don't ask me what in the world the accents and squiggly bits in polytonic Greek mean; I haven't the slightest clue.

Not snarling up markup? That I can manage. After an hour or so of research, I found fonts and tools that could enable me to do the keying job correctly and with reasonable efficiency. ACLS and I agreed on a price, and off I went. I didn't know what the squiggles meant, but I could reproduce them, and that was plenty good enough.

When it came time to proof my work, I didn't rely just on my own eyes; that would have been stupid. I called in my classics-major husband. He found typos and the odd homeoteleuton, which I duly fixed up. I sent the result back to ACLS, and they were happy enough to pay me, so there that is.

And there we have it: a partnership between a tech geek and a reasonably well-trained domain specialist (kindly note that my husband was an undergraduate classics major) took care of a data job. I think this can happen more often in more fields.

The chief barrier is the belief that it can't.

2 responses so far

Can we just give the problem to the libraries?

Oct 28 2009 Published by under Tactics

I pointed out Mike Lesk's slideshow in my last tidbits post, finding it a good critical précis of the data problem. It's pleasantly aware of human problems, human problems many treatments of cyberinfrastructure (including, unfortunately, this otherwise useful call to action from Educause) wholly ignore.

So wince and flinch at the design (black Arial on white? really? in 2009?), but read the slideshow anyway.

I do want to pick apart the slide from which I took the title of this post. I reproduce the said slide's text in full:

Can we just give the problem to the libraries?

As a professor in a library school, I wish I could say that libraries were the obvious organization to take care of data. They understand keeping things for a long time and arranging to find them later. It would be a sensible new activity to balance a decrease in foot traffic into book collections. But...

  • They have not been ambitious in this area; libraries feel under budget pressure and don't want new tasks.
  • They lack the subject area knowledge to deal with complex data sets in scientific areas.
  • They often lack the technical skills for advanced data handling.

I have no quarrel whatever with Lesk's first point. Libraries have absolutely been timid about this, and they still are—not without reason, either! This, to me, is the buck-stopper, the Berlin Wall, the concrete bollards. If library administrators shy away from this, or give it lip-service only, Lesk is right and there's nothing to be done. It won't matter how many librarians are ready and willing to do this work, if they're not allowed to or not given sufficient resources and authority to.

How likely is this outcome? In my estimation, more likely than not. My estimation is admittedly colored by this being very early days yet, but as I've remarked before, the longer any interested group dithers, the more likely it is that the action will be elsewhere. The more the action moves away from libraries, the more likely library administrators are to breathe a quiet sigh of relief and turn away from the problem altogether.

So what is a librarian who wants this work to do? Well, one answer is to keep an eye on discipline-specific projects, those that are larger than any single institution, the up-and-coming ICPSRs and Sloan Digital Sky Surveys. For those interested in data curation inside an institution, I think the answer may well be to learn enough to insinuate oneself onto research teams directly through their in-house IT arms. I may revisit this answer later; in-house IT is starting to become just cost-ineffective enough that some recentralization may happen. In that case, the would-be data curator has more options. Either way, though—the wise data curator does not attach himself limpetlike to the library. The action may well be elsewhere.

What is a researcher or funding agency or think-tank that wants libraries to take on this work to do? Researchers need to ask. Nothing gets library priority so fast as a well-articulated request from faculty; that goes double in disciplines where physical library spaces are waning in importance. Agencies and think-tanks: I'd recommend being an awful lot clearer about what the services provided look like and how they need to be staffed. Laundry-lists of skills are useless without an estimate of FTE and budget; such an estimate is noticeably lacking in every single discussion of this problem I've ever read.

I half-agree, half-disagree with Lesk's second point. There's a lot of disciplinary knowledge in academic librarianship. We don't select books blindly! We do it by taking heed of what our local researchers are doing. Many selectors and liaisons assigned to particular disciplines have degrees, sometimes advanced degrees, in that discipline. In the social sciences, by the way, data librarians with appropriate disciplinary knowledge already exist.

The problem isn't the non-existence of disciplinary knowledge; it's the uneven spread of it. For any given discipline at a research university, I'd guess it's a better-than-even bet that the library has a librarian somewhere with appropriate disciplinary expertise—but it's not a certainty.

Of course, there's also a question of how much disciplinary expertise is actually necessary for this work. Diane Hillmann remarked to me at ALA this summer that "[researchers] all think they're special snowflakes," but in her experience the basic sustainability questions don't differ all that much from dataset to dataset. That's what I think, too, with the added wrinkle that disciplinary specialists may actually be too close to their data to have a good read on how others will want to use and query it. An outsider perspective may well be useful!

(The real problem is one of first impressions and secret handshakes, as my SciBling Christina adroitly points out in the context of reference interviews.)

I could very nearly recycle the answers I just gave for Dr. Lesk's second question for his third. In aggregate, research libraries have quite a lot of technology expertise. How much any given library has isn't predictable, and may well not be sufficient.

If we cross the answer to the second question with the answer to the third, we approach the real conundrum: sufficient disciplinary expertise and sufficient technical expertise tend not to coexist within the same librarian. Take me, for example: if it's textual or linguistic data, I'm your librarian—that's my educational background! I can apply common sense and well-honed data-management expertise to numeric or instrument data, but I can't apply disciplinary knowledge because I don't have it. Selectors and liaisons, conversely, likely understand quite a lot about local research in the disciplines they serve, but they mostly don't sling Python and XSLT, nor do they tend to have the digital-preservation knowhow that I do.

John Saylor of Cornell gave what I believe to be the appropriate answer to this problem in his talk at ALA Annual: a technical team dedicated to data needs to work with librarians who have disciplinary expertise in order to solve problems. The disciplinary coverage achievable with this staffing model won't reach 100%, but it'll get as close as seems feasible. Nota bene: without broad participation by disciplinary specialists across the library, a data-curation service suffers and may well fail!

Lesk's objections are serious, pertinent, and pointed. They are not, I believe, unanswerable, but answering them will take considerable vision and will on the part of research-library administrators. Time will tell.

2 responses so far

Classification and a bit of subject analysis

Oct 26 2009 Published by under Uncategorized

It's been a while since I did anything on my series about library ways of knowing. If you'd like to refresh your memory:

Today I'll finish my discussion of classification, and distinguish it from subject analysis, since that distinction often seems to confuse, especially in our digital age.

So if we'll recall, the goal we set for ourselves was to collocate physical books on shelves in such fashion that their arrangement would be useful to information-seekers. With most non-fiction, that means collocation by subject, by what the books are about.

(There are lengthy philosophical discussions of "aboutness" in the information science literature. I recommend avoiding them with all your strength. They make my eyes bleed.)

To make this work, we have to map knowledge-space onto physical space: divide up human knowledge into convenient slots to assign books to. This is, you might say, a tall order: an ontology of infinite domain, but where each item can only fit in one place.

In the States, most libraries use one of two such maps: the Dewey Decimal System or the Library of Congress Classification. About the kindest thing one can say for Dewey Decimal is that it was a product of its peculiar time; for today's purposes, it is heavily overnumbered in religion, for example, and undernumbered in science. Perhaps worse, its sense of the world is not exactly immediately intuitive to the modern eye: why the long separation of geography from the so-called "social sciences," of which psychology is apparently not one?

This is one danger of any would-be universal classification. Our sense of the world and its knowledge changes over time, sometimes quite a lot and quite suddenly. If our ontology doesn't keep up, it serves its purposes less and less well. How easy is it, really, to find the right shelf in a library of any size organized by Dewey Decimal? Considerations such as these no doubt informed the shift of one library (and later others) to the BISAC codes typically found in large bookstores.

Another danger of the universal classification is that its specificity is of necessity somewhat limited. Many medical libraries, for example, ditch Library of Congress Classification because it just doesn't drill down far enough into medical minutiae for their needs. The NLM Classification fills the gap.

With physical books, we cannot escape the constraint that each book must go in one and only one place on the shelf. Once we're away from the physical item, that constraint disappears. The card catalogue was the first desperately clever escape from the tyranny of the physical item: in a card catalogue, the same book could be "shelved" by author, title, and one or more (usually three to five, to avoid overproliferation of cards) subjects assigned to it by the cataloguer.

This meant the addition of a subject-heading system to the classification vocabulary. You can't just add more classification numbers to the physical item; you then imply that it goes in more than one place! This is the difference between Library of Congress Classification and Library of Congress Subject Headings. Under most circumstances, the LCC number assigned to a book will correspond closely in meaning to the first LCSH assigned in the book's catalogue record. They are still distinct systems, however! Don't confuse them. Librarians chuckle behind their hands.

Of course, digital items don't have to live in just one space. Classification is therefore slowly giving way to subject analysis and similar ways of relating items to each other as digital libraries develop.

And that, in a remarkably simplified nutshell, is how books are arranged on shelves in libraries. It doesn't happen by magic!

No responses yet

Tidbits, 23 October 2009

Oct 23 2009 Published by under Tidbits

My tag overfloweth…

  • A challenge to libraries from an information science professor: "I wish I could say that libraries were the obvious organization to take care of data… But… they have not been ambitious, they lack the subject area knowledge, they often lack the technical skills." What say ye, librarian Trogoolies?
  • Cross-disciplinary use of data shines in this account of the decline of the Maya. "Space technology is revolutionizing archeology." Who would have guessed it?
  • On the tools front, take a look at the Tranche Project, aimed at securely sharing datasets among researchers.
  • On the interesting collections front, Canadensys is trying to collect biodiversity information from various researcher networks. Their technical infrastructure is very much "use what you have; build only what you must."
  • Why build yet another silo for data? Exploring curation micro-services is a great introduction to the simple, UNIX-y tools coming out of the California Digital Library.
  • And because it's Friday, a lovely lyrical reflection from John Mark Ockerbloom on why preservation matters. Sometimes it's not all about the bottom line.

Have a good weekend!

No responses yet

Open Access Week: Profile of Sarah Shreeves

Oct 20 2009 Published by under Miscellanea

I have intentionally steered Book of Trogool away from open access. I still believe in it; I still work for it. Toward the waning days of Caveat Lector, however, it became clear that I was shedding more heat than light on the subject, so I made a conscious decision not to repeat that mistake here.

This is, however, Open Access Week. I would feel rather churlish about ignoring that, especially since I was speaking yesterday for the occasion. What I'll do, then, is try to set a radical example I wish others in the open-access movement would follow: I'm going to celebrate a librarian.

Her name is Sarah Shreeves, and she works for the University of Illinois, where she runs the IDEALS institutional repository and has just accepted the post of Scholarly Commons coordinator.

Sarah has built IDEALS the hard but honest way. No gimmicks, no big behind-the-scenes uploads, no lofty unmanageable promises, not even ETDs until just recently. Sarah forges relationships. Sarah puts her energy behind useful software development. Sarah gives good service. That's what Sarah does. IDEALS isn't the largest IR out there, but in all honesty, when I pull something to read for myself out of an IR, more often than not IDEALS is its source. That says something about the quality of the material therein.

In fun, I often call myself Sarah's evil twin, as our thoughts about open access and institutional repositories often dovetail. Sarah's genius is that quietly but inexorably, she makes those thoughts not only known, but the coin of the realm, changing minds so deftly that they hardly know they've been changed.

I have seen Sarah present, and been privileged to sit on a conference panel with her. Her style is unassuming, unthreatening—but don't let that fool you: what she says will challenge you, and she never accepts received wisdom at face value.

Sarah takes bold steps to establish open access and related issues firmly in the canons of librarianship. She co-edited the institutional-repository issue of Library Trends that came out earlier this year, and she was the one to request an article from me, the article that turned out to be "Innkeeper at the Roach Motel." She is also ultimately behind Illinois's support of BibApp.

This is a poor tribute at best; when the history of open access is written as it should be written, Sarah will occupy many pages therein. For this Open Access Week, I salute Sarah and her many accomplishments.

7 responses so far

Graft or hybridize?

Oct 17 2009 Published by under Tactics

I've lived all my short career in academic libraries thus far on the new-service frontier. In so doing, I've looked around and learned a bit about how academic libraries, research libraries in particular, tend to manage new services. With apologies to all the botanists I am about to offend by massacring their specialty, here is my metaphor for the two main courses of action I see: grafting the new service on like an apple branch to a crab-tree, or hybridizing the new service with existing services, thus changing the library from the ground up.

Each approach works in some situations, it seems to me. Each approach may fail in others. The question relevant to Trogoolies is—how should a data-curation service work? Graft, or hybridize?

Computer systems administration and tech support, for example, are grafts in most libraries I'm aware of. They have their own staffs who do their own thing and don't interact much with other library staff or services except when something breaks. (It could be argued that the introduction of computers into libraries was a hybridization process; this is true, but it doesn't mean that the library organization necessarily hybridized, and in fact I don't think it did.) For the most part, this seems to work fine. Catalogers and instruction librarians don't need to learn how to configure Apache or Tomcat!

MARC cataloguing, oddly, is a grafted service of long standing. I don't wonder some cataloguers feel helpless before the onslaught of outsourcing, stub records, collaborative cataloguing, et cetera. Because cataloguers are grafted onto the library, it's relatively easy to think about their value in isolation from the rest of the library and sort out how to achieve that value without them. There is a roaring segment of the library literature that believes this is both desirable and inevitable; another segment, of course, is pushing back with all its strength. The proof will be in the budgeting.

An example of the graft approach falling down, it seems to me, is the intersection of systems librarianship with public service. Website design and management. Institutional repositories. Even (some) digital libraries. The graft charged with these matters needs either to forge its own public-service links with the patron base, which is a Sisyphean task for a grafted service's typical level of staffing, or it has to go through the tree-trunk to leverage that trunk's existing contacts. Unfortunately, the trunk may or may not decide to be helpful. Cues from library administration matter considerably, I believe; a tree-trunk that is not told to help a graft will simply starve it. Starvation happens a lot, especially when library administration isn't itself clear on the value of the grafted service.

(The other problem with grafted services, of course, is that it's terrifyingly easy to lop a grafted branch off the budget tree. I'm unhappily witnessing that threat to a number of digital libraries and IRs now, as it happens.)

Information-literacy instruction strikes me as a service in the process of hybridization: fully hybridized at some libraries, partially in others. I do know of some libraries that try to treat it as a graft, but that road seems to lead to too much work for too few people, and eventual hybridization to handle the load. Hybridized info-lit programs seem to work reasonably well, though admittedly there are longstanding questions about the general level of pedagogical skill in librarianship.

Collection development seems to be going in the other direction. This was unquestionably a core service not so long ago in librarianship, and many libraries still consider it one. Events have conspired to push it aside, however, from the Big Deal to approval plans to Google Books. What I'm seeing now (cf. 2CUL) is a willingness to confine this labor to an ever-smaller group of people per library, and a growing belief that holding deep disciplinary and existing-collection knowledge locally isn't the crucial asset for collection development that it once was.

(The question of "local" collection development is only just starting to arise. It's an interesting one for Trogoolies! But this post will be long enough as it is, so…)

So let us consider data curation for a moment. Is it so specialized and grant-funding-driven that a grafted service is appropriate? Or should libraries undertake the fearsome organizational work necessary to hybridize it? (Make no mistake, the organizational work is indeed fearsome, not to be lightly undertaken. Instruction did not travel an easy road to its current hybridized state, and this absolutely brilliant preprint (PDF) discusses the horrendous difficulties one research library bravely sought to conquer when it tried to hybridize scholarly communication.)

My cards on the table: I believe that because of the disciplinary knowledge and necessary public-service responsibilities inextricably entwined with data curation, a data-curation service grafted onto the library may succeed in the short run (or perhaps spottily, in one or a few disciplines), but will fail in the long run. To thrive, data curation will have to become part of the library's core, touching—changing—reference librarians, liaisons and "embedded" librarians, selectors, instructors, systems librarians, and others.

We know from multiple research studies on the subject that researchers believe that the sine qua non skill for data curation is disciplinary knowledge. I have had my doubts about that; I still have my doubts about that. But all by itself, the perception is important, because researchers are the gatekeepers for their data and they won't let people they perceive to be disciplinary ignoramuses anywhere near. In practical terms, then, Achaea University's library can dub Ulysses Acqua a data curator, but he'd better not go anywhere near Dr. Helen Troia or her data without either an extensive basketology background (and librarians are often generalists, but nobody knows everything) or the knowledgeable basketology liaison librarian Menelaus Fox to back him up.

Trust me on this one: Menelaus Fox isn't going to move so much as the tip of his little finger for Ulysses Acqua or Dr. Troia's data unless he's told he better. There are exceptions to this rule, of course, and these exceptions are how pioneering data-curation grafts are bootstrapping themselves. If you ever have a chance to hear Marianne Stowell Bracke of Purdue speak, do yourself a favor and go; she is a sterling example of the exception to the rule, and she's what I think libraries wanting to make campus-wide data curation work will need to aspire to in most (if not all) of their discipline-related staff.

In short, libraries considering data-curation programs will almost certainly start them as grafted services; I can't imagine immediate or anticipatory hybridization even being considered. I myself would be very leery of working for any service whose library administration doesn't have hybridization ambitions for it, however. Such services seem liable to wind up in institutional-repository limbo, which helps no one.

When I have time, which I emphatically don't right now, I'm going to reread this book. I remember it having very smart things to say about what I'm terming hybridization and (if memory doesn't fail) the book refers to as "mainstreaming." I encourage librarians to read it with me!

No responses yet

Quick thought: rejecting data or rejecting people?

Oct 16 2009 Published by under Tactics

I'm still buried in translating a presentation into Spanish for Monday and finishing another in English for Wednesday, but here's a small thought to tide folks over, a thought that came to me shortly before my presentation at Access.

At the data-curation workshops I've been to, it has been axiomatic that "we can't afford to keep it all." Some fairly sophisticated judgment rubrics have been worked up, often based on the same kinds of judgment calls that special-collections librarians and archivists make when presented with collection opportunities. Is this dataset unique, or could it be recreated? Is it well-described? Is it in good shape? What is its importance to its field? Et cetera.

There's a problem with this mode of decision-making. It's a human problem. It's a problem that is endemic in the institutional-repository context, which is where I became acquainted with it.

The problem is perhaps best illustrated with a parable; I'll borrow Achaea University from Caveat Lector. Dr. Helen Troia comes to data archivist Ulysses Acqua with a pile of helter-skelter basketology data. Ulysses scrutinizes the dataset (with the help of basketology liaison Menelaus Fox), assesses its value honestly, and decides it just doesn't make the cut. He tells Dr. Troia so, stating his reasons in a professionally courteous fashion.

Will Dr. Troia come back to Ulysses five years later, when she's created the dataset that will revolutionize basketology forever? Not terribly likely, I'd say.

There are people behind every dataset, people who care deeply about their work. Rejecting their data is tantamount to rejecting their work, rejecting them as researchers. While such rejection may still be necessary, it should not be done lightly—it is an act with far-reaching political repercussions.

What, for example, will Dr. Troia tell her departmental colleague Dr. Andromache Memnon about Ulysses and the data service? What happens to the Basketology department's data should Dr. Troia become department chair?

Uncomfortable questions, but ones to take into account when designing and publicizing criteria for what data-curation services accept.

9 responses so far

Comment snafu resolved

Oct 13 2009 Published by under Metablogging

If you've been having trouble commenting, you're not alone—the comment form quit working for me a couple days ago.

I wrote in to Erin, and from where I'm sitting, the problem has been fixed. If you're not getting comment-form love, email me at dorothea.salo at gmail and I'll see what I can do.

Speaking of comments: I am despotic about them, I'm afraid. If I suspect you're a spammer, or if I'm sure you're a timewaster, your comment will silently disappear. I don't expect to have to pull the trigger often (even spam levels around here have been muted), but a warning is only fair.

No responses yet


Oct 10 2009 Published by under Metablogging

I will be speaking for UKSG's conference next April. They haven't given me a topic… but they want a talk title by the end of this month.

I have to write a paper alongside the talk, and I hate writing papers with every last fiber of my being, so if I have to do it, I want to make it count for something.

Anybody got any suggestions? What should I write and talk about, that the UKSG audience needs to hear? If you think you're among the intended audience, what would you want to hear or read from me?

2 responses so far

Set your house in order

Oct 07 2009 Published by under Tactics

Roy Tennant sent me an email about my Access presentation in which he asked what libraries should do about the laundry-list of data-curation challenges I presented. (If you're curious, you can go view the presentation yourself, courtesy of the wonderful A/V folk at Access. The less-than-an-hour-long way to assimilate the same information is to look over slides plus talk notes on SlideShare.)

That's an eminently fair criticism. I've been thinking about it since receiving the email. I think the answer for libraries is to set their own digital houses in order first thing. After all, how can we justify the claim that we can help researchers manage digital data if our own data are a jumbled mess?

  • Do you have digital preservation policies and procedures? Are they clear about what the library can and cannot do? (If all you can do is bit preservation, fine, but say so.)
  • Do you have access to appropriate technological infrastructure for digital preservation? (I say "access to" advisedly. I don't actually care whether the big backed-up disk lives in the library or with campus IT, as long as it's appropriately provisioned.)
  • Do you include local born-digital materials in your collection-development policy? Are your selectors or liaisons actively collecting? If not, start the necessary conversations.
  • Is there unnecessary proliferation of digital-library software packages or hosted services in your library? Fix it. Migrate off the less-capable platforms (which may actually be all of them!) onto something that will scale and be flexible. A platform insufficiently flexible to handle all your digital-library needs will be pitifully inadequate for data.
  • Do you have an institutional repository? Are you getting value out of it? Really? If not, have the courage to migrate the content and shut it down, re-assigning its manager to something more useful—data services, perchance.
  • Do you have physical media such as diskettes and CDs lying around, with material that needs preservation? Fix it. Get the data onto your chosen platform. Chances are these data are fairly representative of the challenges you'll run into doing data curation, so why not learn on your own materials?

A library that has accomplished all the above goals is in extremely good position to start a serious data-curation program.

No responses yet

Older posts »