On preservation versus replication of research data

Sep 20 2010 Published by under Praxis

I often see a cost argument against research-data preservation: if it's cheaper to replicate or regenerate the data than to preserve it, why preserve?

Here's my question: Cheaper for whom?

If we remain within the context of an individual lab, this question is a no-brainer: if it's cheaper to regenerate, regenerate. As we dip our toes into an opener-data world, however, I should think the equation changes rather.

Is it still cheaper for two labs to have to regenerate these same data? Five labs? Twenty labs? How many of those labs will have to buy specialized equipment to create those data, equipment they wouldn't need if the data were shared by the first lab? How much staff time—worst-case, specialized staff time—will be eaten up in regenerating data?

There are certainly offsetting costs to consider: the cost of data discovery, the cost of cleaning up and describing data for sharing, the cost of whatever munging it takes to move data from one lab's context to another's, the magnified cost of any error on the part of the data-generating lab.

Still, my sense is that the discussion around cost has been just a bit simplistic… and is likely to become more complicated as data-sharing norms emerge.

5 responses so far

  • Elizabeth Brown says:

    Dorothea,
    There was a saying I frequently heard as a scientist: There's always time to repeat an experiment. It was said sarcastically (as many things are in the science community) to point that planning time was well-spent, even necessary, to move the experiment forward more quickly. The goal, of course, is to get the experiment "right" the first time so that you don't waste time repeating anything.
    Unfortunately I still think we have many scientists who simply don't trust each other with regard to data - take a look at any of Jean-Claude's posts on this topic. Even the creator of ONS is telling people not to implicitly trust data they see in the literature. This would be interpreted by some as a need to replicate results.

    • Dorothea says:

      Hard to know whether a replication is successful without having the data available in the first place to compare against!

      No, seriously, I take your point, but I don't think it refutes mine. Assume a replication step -- after that, does it make more sense to regenerate over and over again, or use the existing vetted data?

      • Elizabeth Brown says:

        I agree - you need to have some duplication to verify results, and anything past that could be considered superfluous. I think the culture is so ingrained to duplicate that it's tough to step back from that.

  • Chris Rusbridge says:

    I think I remember Simon Coles of the Southampton eCrystals service* making points about the high cost of not keeping "dull" crystal structures or other chemical data. Apparently there are many reactions which look very promising, and so are investigated, but turn out not to be promising. So they are not published, which means they are investigated again... and again... and again.

    * Or it could have been Peter Murray Rust.

    BTW there were also some interesting results from ChemSpider when I was following that. They began keeping records with provenance of chemical claims, like melting point etc. Part of their point was that these "factual" claims did indeed vary, but the repetition and convergence were valuable indicators.

    • ebrown says:

      Indeed, Chris, Tony has been very busy lately showing the innacuracy of property data for chemistry from multiple sources. Also he and other ONS advocates have mentioned the lack of timeliness in posting corrected data. So data curation is not as simple as it seems.