Irreplaceable data

And we're back! (With a four-note theme. Wait, that's Peter Schickele on Beethoven. Never mind.)

So yesterday before our enforced break, I asked what we could learn about e-research from a big chunk of space flotsam hitting Jupiter. What had caught my eye was this passage:

… the planetary astronomy community has been filled with excitement—emails are flying, with people exchanging information about the new discovery and its development. Major observatories are canceling their scheduled observations so that they can point their telescopes at Jupiter.

Why are they doing this? Because this is the only chance they get to record data about this particular event. Once it's over, it's over. And once it's over, any data that have been recorded are irreplaceable when lost, destroyed or otherwise rendered unusable.

Irreplaceable. Scary word. Puts data curation in a new light, doesn't it?

If you work in a field that is not reliant on transient observational data and in which experiments are easily replicable, you are one seriously lucky duck. For the rest of us, we get one shot at what we study, because we're stepping into Heraclitus's river every single day of our research lives.

Don't think this phenomenon is limited to the astronomers and the climatologists. Consider the plight of the linguist recording the last native speakers of a moribund language. Consider the historian or sociologist, or ecologist, or… anyway, trust me, it's widespread.

Some corollaries fall out of the irreplaceability axiom. On a walk around the block during this summer's Arts and Humanities Data Curation Institute, I was (perhaps dubiously) inspired to create the image following, patterned on Maslow's famous hierarchy of needs:


Irreplaceability is the reason I put data-acquisition issues at the bottom of the pyramid. If you ain't got the data in your grimy little hands, none of the rest of the pyramid matters!

This is the chief reason I think institutional repositories as a whole have been (pace Cliff Lynch) a failure thus far. They absolutely reek at getting their grimy hands on data, irreplaceable or otherwise. One may sneer at how such outfits as the Center for History and New Media fare on some of the upper strata of the pyramid, and I have in fact done so (privately heretofore, but oh well; Dan Cohen knows I love him), but there is just no denying that CHNM knows how to get its hands on one-time data.

Another corollary: when we are prioritizing what data we curate, since we simply cannot keep it all, irreplaceable data have a leg up on the competition. I believe in some areas of chemistry (and perhaps elsewhere), some rather heated arguments are taking place about whether to keep or recreate data. Looking at the heinous volume of irreplaceable data, I think I have to fall on the "recreate whenever possible" sword, recognizing that it is a sword.

And one last corollary: researchers who gather irreplaceable data have a special obligation to take good care of it!

Salo's Pyramid, by-the-bye, is finding use elsewhere. No one is so surprised by this as I, since it was a spur-of-the-moment thing (I'd just put Maslow's dissertation in the repository, and… look, my brain is a strange and uncanny place, okay?), but for the record, that entire presentation is licensed CC-BY. Gank in good health.

  • eddie says:

    Hi Dorothea,
    While reading this post I have been listening to a Bad Company dJ set, put together on the hoof for the Breezeblock radio program a number of years ago. It's 44mins of awe inspiring drum'n'bass and one of my most cherished memories.
    The original recording was on C90 cassette but since then it, along with the rest of my record collection, has been backed up to cd-rom and is now on multiple hard-drives, (on a comp that has no web access). I even have compressed versions so my phone can play too.
    I know it's nerdy but I heard about "multiple copies in multiple formats" and took it to heart. My collection has over 14,000 tunes and many are irreplacable. Needless to say I still have the old tapes and vinyl as well.
    In terms of your pyramid, the more immediately accessible formats are more vulnerable to sudden loss but, even when never played, the tapes, vinyl and even cds are ageing.

  • An extremely valuable point, eddie; thank you for making it.
    All media, including analog media, require conservation to survive. Conservation of analog media tends to be less salient in people's minds, especially in the academic context; it's simply assumed that "the libraries take care of that." There's an entire preservation infrastructure underneath analog media that is presumed to be a fact of nature rather than a human endeavor.
    So because that preservation infrastructure doesn't exist for digital materials, yet, people who don't see the analog infrastructure assume that digital preservation is an impossible thing.
    Whereas the same sort of thinking goes into both sorts of preservation, really.

  • Monado says:

    It sounds as if the best thing to do in the short term is not throw away the old equipment. And to use the old equipment to copy digital media to newer forms... for which no one ever gets a budget, right?

  • Something like that!
    That's actually such a good comment that I need an entire blog post to address it. Give me a day or three, okay?