Archive for: September, 2009

Talkin' 'bout my institution: A clarification

Sep 29 2009 Published by under Metablogging

A comment Chris Rusbridge left on a previous post leads me to clarify the extent to which the subject matter of this blog draws on my own position in the institution where I work, and that institution's take on matters data-curational.

In brief: It doesn't. I don't talk about my place of work here, and I have no plans to start doing so.

I have no data-curation or other cyberinfrastructure responsibilities at my workplace save those that happen to touch on my position as institutional-repository manager. The day I acquire such responsibilities, which is not wholly impossible but by no means a certainty, will be… an interesting day for this blog, to say the least.

I do not speak for my institution or its library system here, just as I do not speak of them (except in the heavily meta fashion of this particular post, and I hope to post such posts as this as seldom as I can conveniently manage).

Why do I blog about cyberinfrastructure if it isn't my job? Because I'm interested. I care deeply about digital information, and have for quite a while now, over a decade. I like to think out loud about it, and what's happening with digital research data simply fascinates me. Also, because I believe that raising the profile of librarianship in the research community—assuming I can actually pull that off—is a useful occupation. Where do I pick up what I talk about here? Through reading, thinking, watching events, and talking to a wide variety of people online and off-, much as you'd expect. Not, alas, through hands-on experience at this juncture.

So apply all appropriate doses of sodium chloride to whatever you read here. That I say something seems like a good idea to me doesn't mean I'm doing it. It bears no relationship whatever to what my institution is or isn't doing and thinking.

Honestly, I don't even think of myself as an expert in this area. Several commenters have already caught me out on inaccuracies and other sloppy thinking, and I'm grateful to them for it. I'm trying to figure all this out right alongside everybody else. I believe there's value in taking that free-ranging thinking process public. (Actually, I know already that there is value in it for me! I've learned a lot from Book of Trogool commenters.)

Just so we're all clear on that. Thanks.

One response so far

The dreaded backfile

Sep 29 2009 Published by under Praxis

One of the problems practically every nascent data-curation effort will have to deal with is what serials librarians call the backfile, though the rest of us use the blunter word backlog.

There's a lot of digital data (let's not even think about the analog for now) from old projects hanging around institutions. My institution. Your institution. Any institution. There may be wonderful data in there, but chances are they're in terrible condition: disorganized, poorly described if described at all, on perishable (and very possibly perished) physical media. This pile of mostly-undifferentiated stuff is what all the digital-heat-death-of-the-universe people are on about.

What to do about it? Make no mistake, it takes considerably more human ingenuity and effort to rescue data than to treat it right at the outset. If a small data-curation team just out of the starting gate tries seriously to come to grips with the backlog problem, it will almost certainly swamp itself, to the point that it won't be able to get in on the ground floor of new data-generating projects—which of course only perpetuates the problem.

I hate to say this, but… I believe we'll have to leave a lot of those data lie. We can use some of the backlog to learn on; I would be inclined to start with data relating to a revered institutional priority such as theses and dissertations. We can possibly also pick up a few horses in midstream, researcher workflow permitting.

Grant agencies should look seriously at data-rescue projects, in my opinion. Grant funding is lousy for sustainability, but for rescue projects where the main effort is a one-time licking into shape and the sustainability is a given, grant funding makes a lot of sense. There's certainly no lack of data to rescue!

Still, I strongly believe that the principal priority of a new data-curation team should be new data, new workflows, and new research projects. Perpetually playing catch-up is not a good space to be in. Also, faculty aren't nearly as engaged with their old projects as their current ones, so for good word of mouth and campus visibility, working with current projects is the way to go.

Thanks to Chris Rusbridge for making me think about this. The answer I arrived at wasn't the one I expected to.

A short reminder: I'm at Access 2009 the rest of this week. Blogging is liable to be nonexistent.

6 responses so far

Good user experience is not optional

Sep 25 2009 Published by under Praxis

Sometimes it's worthwhile to let my "toblog" folder on marinate a bit. Posts I recently ran across on two different blogs illuminate the same point so well that they deserve their own post here!

Off the Map offers Huffman's Three Principles for Data Sharing, which are really principles for data-collection and -display applications:

  1. Create immediate value for anyone contributing data.
  2. Make contributors' data available back to them with improvements. (emphasis mine)
  3. [Urge users to] share derivative works back with the data-sharing community.

Absolutely. These three principles boil down to "Offer value for effort." We can build the biggest, most bulletproof disk and the shiniest audit trackers and the most knowledgeable data-curation staff the world has ever seen, but if researchers do not perceive value, or if the effort necessary to realize that value is excessive, they will stay away in droves.

We know this already. How we know it… well, that's the second post, this one HangingTogether's discussion of the popular paper-sharing service Mendeley. Why is Mendeley popular? According to the post:

  1. Its appeal is intuitive.
  2. It is instant.
  3. The demands it makes are low compared to the benefits it provides.

Do click over to read the whole post, which provides considerably more detail about Mendeley's featureset and why it is attractive. (Lack of bias alert: I am not a Mendeley user. You will pry my Zotero out of my cold dead fingers. Go Patriots!) The paragraph I want to address here is this one:

If it realises the potential many people are now predicting, the library community is bound to ask why a web application based on an entertainment model should have proved so much more attractive than the painstakingly built repositories we have been holding under the noses of our academic authors over the last several years?

Speaking as a librarian who's been running institutional repositories for nearly five years, I'm not asking. I think we know already. Mendeley offers low effort and high perceived value. IRs demand high effort and return negligible perceived value. If you know IRs, go back over the two lists above and think about how IRs rate. For myself, I will say that I try to add value to items in IRs when I can—sometimes it's as simple as name deduping so that an author browse or search works predictably—but I am under no illusions otherwise.

We cannot, must not repeat this error as we design data-curation systems. (Systems are more than technology, let's not forget; the human-resource aspect of service design matters too.) We cannot. If we do, the next Mendeley will eat our lunch—and then go under, taking all the data stored there with it. (Wonder what wakes me up at night in a cold sweat? Now you know.) Think it can't happen? Say, did you hear the news about Geocities? And ou sont les ebooks d'antan?

Perhaps the title of this post is better stated: "Bad user experience is not an option."

7 responses so far

Because I feel neglectful

Sep 24 2009 Published by under Miscellanea

I know I said I'd be neglecting the place for a bit… but I still feel bad about that!
Here's what I've been working on. I'm afraid this is sort of the Cliff's Notes version, but at least it looks pretty?

Grab a bucket! It's raining data!

If you're coming to Access 2009 next week, you'll see the full version, which should make a bit more sense.

2 responses so far

Cost and service models for data curation

Sep 19 2009 Published by under Tactics

In many of the data-curation talks and discussions I've attended, a distinction has been drawn between Big Science and small science, the latter sometimes being lumped with humanities research. I'm not sure this distinction completely holds up in practice—are the quantitative social sciences Big or small? what about medicine?—but there's definitely food for thought there.

Big Science produces big, basically homogeneous data from single research projects, on the order of terabytes in short timeframes. For Big Data, building enough reliable storage is a big deal; it's hard to even look at the rest of the problem until the storage piece is solved. Some in the data curation space focus unabashedly exclusively on Big Science—Lee Dirks's well-constructed and lucid talk at Harvard yesterday hinted that he is one of these. Standards for data tend to grow fairly quickly in Big Science environments, both de facto (because there's only one source for the data!) and de jure (as in astronomy, which is a fascinating story I'm not quite competent to tell).

Big Science also has big money. It can't be done at all otherwise. The corollary to big money is big teams of researchers and allies.

Small science is what those of us who work at colleges and universities are more accustomed to. Grants are small if they exist at all; research is generally a solo or single-lab endeavor. Research procedures are often ad-hoc, invented by the researcher like Minerva springing from the head of Jove. Data standards do not exist; as often as not, there isn't a critical mass of people doing similar enough work and willing enough to share data to come together to create a data standard.

It has been asserted that small science, taken as a whole, is likely to create more research data than Big Science. When I tracked this assertion toward its source some time ago, the source turned out to be an otherwise-unsupported statement in the Chronicle of Higher Education (can't link; article behind paywall). So I give you this assertion despite not having any proof for it other than intuition. It is intuitive: Big Science accounts for few researchers owing to its expense; small science is a horde, comparatively. Many small datasets add up startlingly fast, partly because storage for each one is less of an immediate issue, partly because the fundedness or Bigness of a science is not necessarily a good measure of its data requirements. (Any research creating high-def digital video in quantity right now is stuck in just as nasty a storage problem as Big Science.)

When I look at business models and processes for data curation, honestly storage is the least-interesting aspect of the problem to me. Partly this is privilege talking: where I work, the intricacies of digital storage are Somebody Else's Problem. All I have to do is find stuff to fill it up! Partly it's consciousness that this problem is absolutely being actively worked on—watch Dirks's presentation for examples. I have faith that the storage problem will be decently managed.

Mostly, though, it's that I'm a librarian, not a sysadmin. The problems that interest me about data are the description, discovery, format, interoperability, and human problems. And I can see a serious, scary human problem lurking under the Big-versus-small science question.

I'm going to hold it as axiomatic that on some level, all of the data arising from the research enterprise are equal in importance, at least potentially. We can't know a priori which researcher studying which phenomenon in which institution will produce data that make possible a startling insight. We triply can't know this a priori because of aftermarket (so to speak) data mashups. The original experiment may have been a bust, or the original observation apparently uninteresting, but just combine those data with other data and watch them fly!

It does not seem, though, that under the data regimes emerging, all data will receive equal care. Even within our own institutions, them that has the gold will make the rules, as "cost recovery" becomes the order of the day. Big Science has the gold. Small science doesn't, and neither do the humanities.

I wonder whether cost-recovery institutional cyberinfrastructure will manage to survive, honestly. (I hasten to say I don't know that it will fail, but I have misgivings.) Big Science has a history of funding and managing its own research-related services, even to running its own libraries. Why would data curation be the exception? Arguably it should be because of the long-term, past-grant-expiration sustainability requirement, but I don't think that argument has ever stopped Big Science before. So where are cost-recovery ops going to recoup their costs? Small science can't pay. And how is cost-recovery a viable business model for data that has to survive lean grant times, anyway?

There's a scale problem involved, too. Because Big Science creates lots of basically homogeneous data, once you're past the storage problem, the other problems are fairly efficient to solve. Once you've sorted out how to describe Big Science data, the procedures can be institutionalized, solved en masse over the entire project. Set it and forget it. Human-resource cost per terabyte of data: minimal, even absurdly small.

Small science, by comparison, creates lots of little pieces of highly heterogeneous data. Without standards, each piece will need individual attention if it is to be adequately described and future-proofed. Human-resource cost per terabyte of data: frightening. Certainly, some of these data will be relatively simple to cope with, and I do expect standards and practices to improve generally; it won't always be necessary to explain the idea of metadata to people. Even so—this is high-touch, high-expense work, even when the actual storage requirements are minimal!

Where is the money to come from? I don't know. Until we all interrogate some of the assumptions underlying our business models, however, we won't be able to advance equitable solutions to the data-curation problem.

5 responses so far

Object lesson: when researchers run repositories

Sep 17 2009 Published by under Tactics

I commented here earlier, not without frustration, about a pair of researchers who built and abandoned a disciplinary repository. I was particularly annoyed that they seemed to have done this purely for self-aggrandizement, apparently feeling no particular attachment to the resulting repository.

Such as they should not open repositories. Neither they nor any service they offer is trustworthy. I hope that's uncontroversial. Unfortunately, even vastly better intentions than that don't guarantee the sustainability of the result, even in the short term.

The Mana'o anthropology repository, started by Dr. Alex Golub on the traditional server-in-the-basement that is the origin of many a worthwhile project, has been encountering significant technical difficulties. Dr. Golub is no longer able to maintain it, and is looking for some way to hand it off.

In my reasonably well-informed opinion, any such one-person effort will need a rescue at some point. If nothing else, people die unexpectedly! Dr. Golub's mistake isn't that he wound up needing a rescue; it's that he didn't anticipate and plan for a handoff from the beginning.

It isn't just Mana'o, I'm afraid. How many disciplinary and institutional repositories have done succession planning? If not, why not? Do it. Now. It is flagrantly irresponsible not to.

Scholarly societies have the best fit with the disciplines, of course, but many who might otherwise accomplish rescues are hamstrung by the need for anything requiring effort to pay for itself or even make money. AAA won't be picking up Mana'o, I confidently predict based on their track record vis-a-vis open access.

Librarianship's continuing error, as I pointed out in the post I linked above, is that we have no infrastructure or plan for accomplishing these rescues, which (I anticipate) will continue to be necessary and may even accelerate in the coming years. Institution-based efforts such as IRs have the technical and human-resource capacity to pick up the slack; what they don't have is policy that allows them to, and coordination to notice work that needs doing and parcel it out appropriately.

As for institutional IT, which might be another natural place to look—they, too, have no policy mandate to address needs originating outside the institution.

How does this relate to data? Well, the problems are the same, really. One-researcher or one-lab IT infrastructures live on a razor's edge; one missed grant may kill them. They hardly ever consider succession planning; worst-case, their IT people (usually wrongly) believe that whatever they're doing is perfectly adequate and will not accept gentle correction.

What this suggests to me, among other things, is that passive data collection is inadequate as a data-repository population model. (Not a surprise, I'm sure; we tried that with IRs and it failed.) Someone needs to go out there and find the good stuff, then open the conversation about how best to keep it.

It also suggests that we need to open a discussion of this issue in cross-institutional fora. CNI, ARL, Educause, JISC, where are you?

7 responses so far

Tidbits, 16 September 2009

Sep 16 2009 Published by under Tidbits

The Book of Trogool turns another page...

  • Social scientists and medical researchers, pay attention to this: "Anonymized" data really isn't—and here's why not. If informaticists aren't starting to run similar analyses on their own "anonymized" data, they should be. This is a serious concern.
  • One for the humanists: the rather vaguely-named Scholarly Communication Institute Report from Virginia. The theme was using spatial data in the humanities.
  • From my SciBling Christina: Anybody can code… but should you? Peer review is for more than published papers. Holding your code close to your chest probably means you're writing unnecessarily bad code. Trust me. I write a lot of bad code.
  • The data tell the story. Government data in this case, but imagine what could be done with research data! Imagine!
  • What is the scientific paper? A sensible outsider's view. Money quote for our purposes: "Like it or not, science increasingly depends on data being published in public machine readable formats."

Personal note: I may be a little scarce around these parts for the next little while. I have three presentations to give in the next six weeks, none of them the same, none of them finished yet. In fact, two of them are but gleams in the back of my cerebellum. This is eating most of my off-work time at present.

Hope your Hump Day was fruitful.

One response so far


Sep 14 2009 Published by under Praxis

When I was but grasshopper-knee tall, my father the anthropologist took me to his university's library to help him locate and photocopy articles in his area of study for his files. He had two or three file cabinets full of such copies. (He may still.)

I have similar file cabinets, two of them: my account and my Zotero library. The account consists merely of links. The Zotero library, on the other hand, includes the actual digital object(s) as often as I can manage it (even at a major research university like MPOW, I cannot always lay eyes on everything I want to read). Zotero is capable of holding onto those items for me, and even backing them up in "the cloud" (actually my university-provided, passworded WebDAV space) without setting them free on the open Web in ways that would clearly and obviously violate copyright.

Now let us consider LOCKSS, and particularly the variant known as "Controlled LOCKSS" or CLOCKSS. Without wandering into the techie weeds, these programs do some elementary digital preservation on the e-journal literature by reproducing it widely in a geographically-distributed fashion, and coming up with policies for the opening of parts of the dark archive thus created in case of a crisis that removes the normal methods of access.

The thought occurs that Zotero, Mendeley, and similar bibliographic managers are a sort of do-it-yourself LOCKSS system. Metadata? Check. Digital items? Check. Reproduced widely in a series of dark archives? Check. All we're missing is the crisis-policy piece.

I don't wish a huge e-journal database loss on anyone, believe me; we would all be the poorer. I do feel just a tiny bit relieved, though, about this particular emergent effect of the widespread use of popular bibliographic managers.

This adds another soup├žon of urgency to the movement for open data, to my mind. Open data will often (storage space willing) be reproduced by other researchers wishing to work with them. All by itself, organically, that phenomenon helps insure those data against loss, destruction, falsification, and other evils.

No responses yet

ETDs as the data-curation wedge?

Sep 11 2009 Published by under Tactics

Many doctoral institutions now accept and archive (or are planning to accept and archive) theses and dissertations electronically. Virginia Tech pioneered this quite some time ago, and it has caught on slowly but steadily for reasons of cost, convenience, access, and necessity.

Necessity? Afraid so. Some theses and dissertations are honest digital artifacts, unable to be faithfully represented in ink on paper or in other analog fashion. Others might be flattened into analog, but that wouldn't be their (or their author's) preference. Still others contain digital artifacts of various sorts. Source code. Multimedia. Data.

ETDs don't pose any special digital-preservation challenges over and above the usual. (I got into an exchange on Twitter yesterday about a dissertation presented with a web content-management system, raising the issue of the artifact's sustainability given the CMS dependency. But any CMS with any content involves those same issues.) What they do present, given their popularity among faculty, students, administrators, and even (some) librarians, is an opportunity.

Institutions consider dissertations to be vital institutional history. (Master's theses—well, that varies from institution to institution, and even within institutions.) There can be no question of throwing away a dissertation simply because it's digital; an institution receiving digital dissertations has no choice but to do something about them.

Now, a lot of institutions, it seems to me, aren't doing much or are doing the wrong things. (If your institution has an unaudited pile of CD-ROMs, that's the wrong thing. Perfectly understandable given the circumstances, but still wrong in today's technology environment.) This shouldn't be surprising or terrifying, nor is it excuse to excoriate the institutions. We all do our best with what we have and what we know at the time.

However… the tools now exist for us to step up our digital-preservation game, and ETDs give us an unassailable, mission-critical reason to. Remember, the problems aren't specific to ETDs, so if we solve them for ETDs, we've solved them for a wide swathe of other kinds of documents and data as well.

Perhaps instead of spinning jargon-laden webs of words such as "cyberinfrastructure," we should start with an easy-to-recognize problem that we already know we have.

8 responses so far

Webcast in a week

Sep 11 2009 Published by under Miscellanea

I wanted to call attention to this event at Harvard, which will be webcast live next Friday at 12:15 Central.

The difficulties in combining data and information from distributed sources, the multi-disciplinary nature of research and collaboration, and the need to move to present researchers with tooling that enable them to express what they want to do rather than how to do it highlight the need for an ecosystem of Semantic Computing technologies. Such technologies will further facilitate information sharing and discovery, will enable reasoning over information, and will allow us to start thinking about knowledge and how it can be handled by computers.

I have my… issues… with "semantic" computing. I also think that it'd be nice to figure out how we're going to manage all these data before we get all starry-eyed about what semantics we can pull out of them, but I'm funny that way.

Even so, I think this will be a worthwhile talk and I plan to tune in. Hope you'll do the same!

No responses yet

Older posts »