Tidbits, 29 July 2009

All of today's tidbits are from one blog! Well, all but one.

  • David Rosenthal on digital preservation. I had this bookmarked to blog about, but…
  • Chris Rusbridge beat me to it, saying everything I would have. Yes, online-versus-offline. Yes, research data in uncommon, niche, and/or proprietary formats. Yes, metadata! And yes, thinking for ourselves.
  • Semantic Web of Linked Data for Research? In all honesty, my reaction to "Linked Data" can be summed up in Chris's question mark. I am not a fan of RDF, I remain to be convinced that even small, constrained Semantic Webs are feasible given how slippery human reality-representations are and how fraught the attempt to render them in computer-understandable terms. Chris makes me reconsider, though.
  • My backup rant. No, not mine—Chris's again. But I have the same rant! I would add to it that I have heard many graduate students mourn that their labs push backup chores onto them without the least effort to provision them with appropriate technology. Those labs that think about backups at all, that is…

Chris, I haven't gotten around to reading the latest International Journal of Digital Curation yet; it's sneering at me from Bloglines. I will get to it, though!

Dividing up the pie

Another thing I meant to call out in the context of the Jupiter-goes-boom event was the nod to data gathered by people who aren't connected to the formal research enterprise save tangentially.

This event was first noted by someone not an astronomer by profession, and the article notes that this is hardly the first time astronomers have been scooped. My husband, who is an extremely amateur skygazer and likes to hang out on online astronomy bulletin boards, says that his impression is that astronomers mingle with enthusiasts fairly freely, all things considered, and both sides appear to benefit.

Astronomy isn't the only field where this happens, of course. The Center for History and New Media projects I mentioned in my previous post are essentially crowdsourced news-gathering turned into history. When I was a graduate student in linguistics back in the day, I had occasion to look at Mayan, which amateurs have been instrumental in deciphering. Birdwatchers no more skilled than I are of material help to ornithologists in providing localized bird counts and similar observations. I am also seeing some renewed excitement about "crowdsourcing" various scientific tasks that can't be done by computers but are too laborious and time-consuming to assign to researchers.

So my question about all this is… who's looking after their data? Do data have to come from an accredited scientist affiliated with an institution before they are worth preserving?

Sometimes these questions have answers. Sometimes, not so much.

This points to a larger question, an elephant-in-the-room question. Whose responsibility is all this data gathering and preservation, anyway? "Individual researchers" is an inadequate cop-out, let's just get that on the table right now; without sustainable support, data die when grants fade or retirements happen.

This leaves a few possibilities: funders (notably government), disciplines, and institutions. None of them is unproblematic—in fact, I would go so far as to say that none of them can solve this problem unaided.

Relying on funders assumes that funders will take a long-term perspective on sustainability. Funders can be fickle about this, even government funders; witness the troubled trajectories of the ERIC education database in the US and the Arts and Humanities Data Service in the UK. Worse, outside government vanishingly few funders have resources and infrastructure to throw at this problem; the most they can do is throw money at it in the form of grants, which is not a sustainable funding model by any means.

The line between disciplines and institutions is often a fuzzy one, honestly. The arXiv is the paradigmatic disciplinary preprint repository—but it is sustained by the Cornell University Libraries. Things were not always thus, but such a handoff isn't exactly unusual.

However. When you ask a researcher about her "discipline," she'll probably start talking about her favorite scholarly society. Where are the scholarly societies in all this ferment about data? Gosh, wish I knew. We'll just pass by the American Chemical Society in silence, shall we? They're an outlier and we should all be glad of that… but where's everybody else? Looking for services that members need? Materials that keep members coming back to the society? Why aren't scholarly societies in the data business? I wonder.

Institutions. Institutions have a built-in challenge dealing with data: they have to deal with it over a wide swathe of disciplines. I can't emphasize enough how hard that is! Different formats, different metadata standards (where there are any at all), different ontologies, different patterns of thought, different workflows… there's just no end to the differences.

In these early days, I see a few different institutional approaches to this problem. One is "follow the money." If you've got million-dollar grants, you'll get red-carpet treatment. No grants? No service. When this model is accused of inequity, it throws its hands up and says "since when was life fair?" Another approach is what I call "help the First Son." In the Pesach parable, the first son is the one who approaches his father asking detailed and intelligent questions about Pesach observance, and receives detailed and intelligent answers.

I don't know about you, but I don't know many First Sons among researchers. A few, yes, but not many. A lot of the researchers I know are Third Sons. "What is this?" they say. And a lot are Fourth Sons, who do not even know how to ask. A First-Son approach leaves our Third and Fourth Sons with no answers.

So what we're left with, when we ask who's responsible for data, is a big muddle. Some disciplines have this pretty much sorted. For them, institutional support may be redundant. Other disciplines are under the funder gun; it's still unclear what the institutional role will be there. Many researchers fall into neither group; either their institution helps them or they get no help.

My worry is that as the pie is currently divided, a lot of researchers aren't getting any.

Day in the life of an institutional(ized) repository librarian

There's a string of "day in the life" librarian posts happening, so I thought I'd throw one in. Today wasn't a typical day, I suppose… but I don't really have typical days, especially these days.

6:00-ish am: Wake up, kick the cat off the bed accidentally, get out of bed.

6:20 am: Dressed and etceteraed, sit down with laptop to check out the daily news and a few webcomics. (What? It's my routine. It works for me.)

6:50 am: Feed cats before they kill each other. Or me.

6:55 am: Pick up bag and leave house to walk to work.

7:28-ish am: Arrive at work. Show early-admission pass. Trundle up the stairs and back to the half-an-office.

7:30-8:00 am: Triage email, play voicemail (I was out two days last week, so there was a shocking lot of it), check professional-type blogs and Twitter, make notes on to-do list.

8:00-9:15 am: Answer email as needed. Send out agenda for Thursday meeting, because owing to a local conference tomorrow and Wednesday I'll forget if I don't do it now. Read two documents (one short, one lengthy) on which I will be expected to have an opinion later in the day.

9:15-9:25 am: Make a phone call promised to a faculty member last week while I was out. Arrange for deposit of software source code into the repository. Assure faculty member that updates will be possible (no, the software isn't happy about that, but the software is wrong and the faculty member is right, so I'll cudgel the software into submission). Make mental note to add versioning to the tech wiki's page on new-repository requirements.

9:25-10:00 am: Hack at DSpace's XML configuration for web forms, so that I can roll an update into production later this week without undue disruption. (I thought I had this done, but I had to rethink myself; I want to move everybody to a new thesis form, but on sober reflection it's not a good idea to do that all at once.) Intermittently, plead via Twitter for repository deposits so that I can climb the next hundred milestone, which I'm less than ten items away from.

10:00-11:10 am: Work on the hemi-demi-semi-official e-research blog, embedding the various video bits we have of campus researchers talking about data and writing intentionally minimalist lead-in text.

11:10-11:30 am: Walk over to computer science building for meeting, dodging unbelievable amounts of road and building construction.

11:30-12:30 pm: Discuss first document read earlier, a draft of a proposal for the human element of data curation, destined for the campus IT strategic-planning process.

12:30-12:50 pm: Walk back to half-office.

12:50-1:25 pm: Triage more email. Check blogs and Twitter. Check the repository's item count. 8401, three cheers!

1:25-1:30 pm: Stroll over to School of Library and Information Studies library with colleague for meeting.

1:30-2:40 pm: Hold productive discussion of how SLIS student projects can lead to more library digital collections.

2:40-2:50 pm: Talk with SLIS faculty member on the way back to the library.

2:50-2:57 pm: Hastily triage email, grab fat yellow folder for next meeting.

2:57-2:59 pm: Scoot upstairs as quickly as possible; it's rude to be late!

3:00-3:40 pm: Discuss second document, a subcommittee report.

3:40-4:15ish pm: Walk home.

4:15-4:45 pm: Pull up tomorrow's presentation (for the aforementioned two-day local conference) for last-minute retouches. (8401!)

And thus is a day made.

Tidbits, 25 July 2009

Interesting and perhaps relevant:

  • Jean-Claude Guédon's examination of power in science. Does e-research destabilize this situation? How? If it doesn't, should it?
  • Should copyright in academic works be abolished? Makes the obvious point that journal-article authors don't use copyright for its intended purpose of filthy lucre, and extrapolates from there. What I notice is that journal-article authors use copyright as a bulwark against plagiarism, lack of credit, and (whatever they perceive to be) misappropriation. Copyright is a lousy tool for that. We need better ones. Personally, I'd prefer that they bypass the legal system altogether.
  • A nifty-looking modeling tool: Emergent Trails for brain-process modeling.
  • A workflow tool: VisTrails. (Am I the only person who gets an Ars Magica frisson from that name? I probably am. I am such a nerd.) Tracks who did what to what when, with what result. Pluggable, written in Python.

Have a pleasant weekend!

XML and cows

Because I've seen it quoted, misquoted, and usually not attributed at all… “Converting PDF to XML is a bit like converting hamburgers into cows." That is the quote I know of. It comes from revered XML developer Michael Kay on the xml-dev mailing list in July 2006.

It's possible Kay got this from somewhere else, but I've never seen an earlier attribution. (Comments are open if I'm wrong.)

I hear all sorts of chest-beating about attribution in data circles, often for good and sufficient reason. I think we can stand to get our quotes and their authors right.

Irreplaceable data

Jul 22 2009 Published by under Tactics

And we're back! (With a four-note theme. Wait, that's Peter Schickele on Beethoven. Never mind.)

So yesterday before our enforced break, I asked what we could learn about e-research from a big chunk of space flotsam hitting Jupiter. What had caught my eye was this passage:

… the planetary astronomy community has been filled with excitement—emails are flying, with people exchanging information about the new discovery and its development. Major observatories are canceling their scheduled observations so that they can point their telescopes at Jupiter.

Why are they doing this? Because this is the only chance they get to record data about this particular event. Once it's over, it's over. And once it's over, any data that have been recorded are irreplaceable when lost, destroyed or otherwise rendered unusable.

Irreplaceable. Scary word. Puts data curation in a new light, doesn't it?

If you work in a field that is not reliant on transient observational data and in which experiments are easily replicable, you are one seriously lucky duck. For the rest of us, we get one shot at what we study, because we're stepping into Heraclitus's river every single day of our research lives.

Don't think this phenomenon is limited to the astronomers and the climatologists. Consider the plight of the linguist recording the last native speakers of a moribund language. Consider the historian or sociologist, or ecologist, or… anyway, trust me, it's widespread.

Some corollaries fall out of the irreplaceability axiom. On a walk around the block during this summer's Arts and Humanities Data Curation Institute, I was (perhaps dubiously) inspired to create the image following, patterned on Maslow's famous hierarchy of needs:


Irreplaceability is the reason I put data-acquisition issues at the bottom of the pyramid. If you ain't got the data in your grimy little hands, none of the rest of the pyramid matters!

This is the chief reason I think institutional repositories as a whole have been (pace Cliff Lynch) a failure thus far. They absolutely reek at getting their grimy hands on data, irreplaceable or otherwise. One may sneer at how such outfits as the Center for History and New Media fare on some of the upper strata of the pyramid, and I have in fact done so (privately heretofore, but oh well; Dan Cohen knows I love him), but there is just no denying that CHNM knows how to get its hands on one-time data.

Another corollary: when we are prioritizing what data we curate, since we simply cannot keep it all, irreplaceable data have a leg up on the competition. I believe in some areas of chemistry (and perhaps elsewhere), some rather heated arguments are taking place about whether to keep or recreate data. Looking at the heinous volume of irreplaceable data, I think I have to fall on the "recreate whenever possible" sword, recognizing that it is a sword.

And one last corollary: researchers who gather irreplaceable data have a special obligation to take good care of it!

Salo's Pyramid, by-the-bye, is finding use elsewhere. No one is so surprised by this as I, since it was a spur-of-the-moment thing (I'd just put Maslow's dissertation in the repository, and… look, my brain is a strange and uncanny place, okay?), but for the record, that entire presentation is licensed CC-BY. Gank in good health.

Back in a tick

Jul 21 2009 Published by under Metablogging

I am reliably informed that there will be a server upgrade going on tonight, so ScienceBlogs will be down for the count until it is complete.

While I'm gone, have a look at the goings-on around Jupiter, and think about what that means in an e-research context. I'll be back with my thoughts!

Tidbits, 21 July 2009

Have a high-bandwidth day!

Review: Borgman, Scholarship in the Digital Age

Borgman, Christine L. Scholarship in the Digital Age: Information, Infrastructure, and the Internet. MIT Press, 2007. Worldcat page, Powell's page (no, I get no kickback).

This calm, clear volume provides a thorough grounding in the practices of academic researchers around their publications and their data, and how the Internet is—and in many cases, isn't—changing those practices.

Copiously researched, accurate, and logically presented, the book starts with a 30,000-foot overview of the current situation, then swoops through technology, law and policy, the existing scholarly-communication system, and the issues and opportunities associated with research data, before tying everything together in a cautious view of possible futures.

This is not a book you read for polemic, nor for original insight; those are not its purpose. (I couldn't tell where Dr. Borgman comes down on many politics and praxis questions dear to my heart. Good for Dr. Borgman.) This is the book you read to figure out where we are, how we got here, and where we might be going, much as you might pick up a review article in the Annual Reviews series. As a quick, objective, comprehensive grounding in 21st-century scholarly communication, I have trouble imagining a better book.

If I have a criticism, and frankly I found it quite difficult to come up with one, it's that Dr. Borgman doesn't convey much of a sense of urgency around the issues she discusses. If you're a librarian or administrator trying to figure out how to apply scarce resources to this constellation of problems, this book will introduce you to the vastness of the landscape, but it won't point you to the low-hanging or scarily-perishable fruit.

Researchers: scholarly communication is the air you breathe, so you owe it to yourselves to read this book for orientation. Librarians and research IT folk: if you're feeling lost in all the ferment, this book will give you confidence. Highly recommended.

Many thanks to my colleague Jim Muehlenberg for letting me borrow his copy as reading for the bus ride to and from ALA. I will be purchasing my own for my bookshelf at work, where I keep books I think people who visit me may need to look at.

Evolution or revolution

Lively welcome here at ScienceBlogs, I must say. Two posts, a soft launch, and eighteen comments already!

The comments have turned up a question deserving of further discussion. On my first post, commenter Jim Lund said:

E-research? Why make a distinction? Today there's only e-research and archaeology. 🙂

And on my second, commenter rnb said:

Computers have been used to investigate circuit behavior since I was in college back in the 70s. So should engineers be called e-engineers?

Not trying to put words in their mouths here, but it seems to me they're getting at the same question about how we talk about e-research: evolution or revolution?

I've done both, myself. I talk in evolutionary terms with my librarian colleagues, because librarians are frankly weary of revolution-talk. It just works better to talk in terms of what we already do. You can see me trying to keep things low-key and jocular in this slideshow I did for the University of Wisconsin at Milwaukee:

When do I go all revolutionary? When I'm talking to those who hold the purse-strings. Even if e-research is the normal course of things in a few disciplines, in most it's not. This means that resource provisioning hasn't happened yet in a lot of cases—and a message of "we're already doing this and just haven't realized it!" won't get dollars and staff time allocated.

(In fact, I think some aspects of e-research, notably data curation, are dangerously underprovisioned, and we'll pay for it later… but that's for another post.)

But I don't get as revolutionary as some folks do. The Digital Humanities Manifesto, while I know and respect some of its authors—honestly, it reads to me like concentrated wacky sauce. I can't imagine it convincing an old-school historian or literary scholar, much less a dean or provost, that the digital humanities are a pursuit worthy of prestige and (more compellingly) resources.

I have a feeling—and it's no more than that; I don't know—that the Manifesto comes out of a place of deep frustration. Believe me, I understand frustration, having run institutional repositories for over four years. The most unwise things I do happen when frustration boils over. On that scale, the Manifesto barely rates. It's just bravado, not especially damaging… I hope.

The moral of the story (and I say this as shouldn't) is that we e-research types do need to think about how we present ourselves and our endeavors. We may not choose the same words; we may not even be consistent about the words we choose and use. Eyes on the prize, however. Some words will help us more than others.

What words do you use? To whom? Why?

