Archive for: November, 2009

Heisenberg's Uncertainty Checksum

Nov 10 2009 Published by under Miscellanea

So here's an interesting problem I ran into today. You have metadata in an XML file. You want to make the file self-describingly self-correcting, so you want to embed its checksum inside it. The problem is, you can't add the checksum to the XML file without changing the file's checksum!

Is there an XML verification tool not subject to this particular tail-chase? I don't know of one offhand.

7 responses so far

Collaborative domain-expertise development?

Nov 10 2009 Published by under Tactics

Libraries do collaborative collection development, through consortia and increasingly via direct institution-to-institution arrangements. Reference and instruction are collaborative endeavors—look at any social-networking service with lots of librarians and you'll see on-the-spot crowdsourced reference responses.

Perhaps this collaboration instinct will help libraries respond to the challenge of domain expertise for data curation. Do I need to know cheminformatics, or do I just need to buy a cheminformaticist conference potations until I secure her business card?

Formalizing expertise-sharing arrangements strikes me as rather difficult. Nobody wants to be the person everybody across the country calls with questions about ChemML; when would there be time to get any work done? Still, I would have thought that collaborative collection development had too many moving parts to be practical, and it's being done.

In any case… I have "develop network of domain experts" in the back of my head as a wise thing to do.

No responses yet

Tidbits, 9 November 2009

Nov 09 2009 Published by under Tidbits

Starting off the week with some juicy tidbits:

That should keep everyone out of trouble a while…

No responses yet

No, you can't have a pony

Nov 05 2009 Published by under Tactics

I read the RIN report on life-sciences data with interest, a little cynicism, and much appreciation for the grounded and sensible approach I have come to expect from British reports. If you're interested in data services, you should read this report too.

A warning to avoid preconceptions: If you pay too much attention to all the cyberinfrastructure and e-science hype, it's very easy to fall prey to the erroneous notion that most of science is crunching massive numbers via grid computing and throwing out terabytes of data per second.

It ain't so. It never was so. Will it be so in future? Not any time soon, I'm thinking.

The report-writers don't try to soften their error (and much love to them for it): "There is much talk of ‘big science’, and our initial research design presumed that we would be studying large-scale formal collaborations. But we found that most research groups in the life sciences continue to operate on a relatively small scale, and we revised our plans accordingly."

Again, we don't have hard evidence for numbers or weight of small science versus Big Science. If we plan for nothing but Big Science, however, we're making an enormous error in judgment.

There's a good bit of attitude mining in the report; I have little to add to it, so I will merely recommend that you read it. The lack of carrots for data-sharing is a deal-breaker, just as it was for self-archiving, and I agree with RIN that using sticks only will cause fairly serious backlash.

Skipping to the end of the report, then, we find out what researchers want by way of data-curation support. Namely, everything and the kitchen sink. At zero cost to them or their grant agencies, of course. I don't know why any other response would have been expected; it costs researchers nothing to say in a focus group that they want a pony, so why would they not say they want a pony?

At some point, someone will have to tell them they can't have a pony. I don't envy that person or agency one bit. Even so.

I believe that individuals and institutions planning data-curation services should take researchers' wants as expressed in this report with a generous dash of salt. No institution can give them what they want, because what they really want is for the problem to be taken care of for them without their involvement. What should be aimed for is giving them what they need.

No responses yet

Stepping away from the shiny

Nov 04 2009 Published by under Praxis

There is a certain kind of digital project that strikes terror and dismay into the hearts of digital preservationists everywhere. Not a one of us hasn't seen many exemplars. They make me myself feel sad and tired.

They're projects that, no matter their scholarly or design merit, are completely unpreservable because they were built from unsustainable tools, techniques, and materials. What's worse, even a cursory examination with an eye to sustainability would have at least signaled a problem.

It's not the unpreservability so much. It's the obliviousness that makes me hurt inside.

For various reasons, the digital humanities are particularly prone to this sort of thing. Scientists do use unsustainable tools, but often they haven't a choice (thank you for the lock-in, instrument manufacturers) and most times they're at least aware of the problem.

Humanists, on the other hand, will pick up whatever tool seems good to them without even asking themselves whether the result will last past the lifespan of the tool. Then they bring the resulting binary CD-ROM or Flash-based website or whatever to the library with beaming smiles, and are shocked to find out that the library can't help them.

Proprietary tools and formats are often quite shiny. I remember HyperCard well, and so may you. In its day, there wasn't anything shinier. The problem is, following the shiny to the exclusion of all other considerations dooms a shiny project to be less shiny a year later, hardly shiny at all five years later, and completely inaccessible and unusable five years after that.

(I do not kid. Historians and sociologists of early digital culture are deeply distressed at how much "HyperCard art" nearly fell out of reach forever, though there are now emulators capable of dealing with much of it.)

There are better ways to proceed. They may well be less shiny at first, but the secret is that shiny can almost always be added to solid sustainable data later on, through mashups or interface redesign or whatever takes your fancy. Once its platform is thoroughly obsolete, though, a project may well not be rescuable in any form. Worse yet, piling otherwise-sustainable raw materials into an unsustainable platform destroys the sustainability of those raw materials, too. I've seen it happen!

So please, step away from the shiny and think.

(Thanks to @pseudonymTrevor and other Twitter friends for inspiring this post, and possibly other ones—I am still pondering the intersection of "never done"-ness and sustainability.)

One response so far

Making standards that work

Nov 02 2009 Published by under Praxis

One phenomenon that will be—indeed, already is—utterly unavoidable in the data-curation space is the creation of standards. I once heard Andrew Pace say that standards are like toothbrushes: everybody thinks they're great, but nobody wants to use anybody else's.

Be that as it may, standards development and compliance is one way to make everybody's data play nicely with everybody else's data. It's not the only way, to be sure; one very important way that I'm sure we'll also see more of is Being The Only Game In Town. ICPSR manages this quite successfully, and so does the Digital Sky Survey. If you want to be important in the data spaces dominated by either of these large players, you play by their rules, just that simple.

When there's no big player to lay down the law, though, standards development becomes more attractive. How do you make a standard, then? More to the point, how do you make a good standard, a standard that works, a usable standard, a standard that will last?

I liked this blog post by Adam Bosworth about standards development very much. I think it captures much of the excellence that goes into successful standards as well as the dysfunction attending failed ones. I do want to add a fillip of my own, though, based on my own experience helping to build standards and trying to use standards built by other people.

When you're in a roomful of people tasked with building a standard, make sure the room contains representation from every group of people who will be asked or required to use it. That emphatically includes the non-technical and the non-specialist. It goes double or triple if the standard will affect existing technology installations: you must have someone in that standards room who uses the existing technology! No, a developer of the existing technology does not fulfill this requirement, because the distance between developers' understanding and users' understanding is often vast.

If the non-technical, non-specialist representative in the room can't understand the standard, it will fail. If that representative can't produce data that fit the standard, likewise. I agree with Bosworth's reservations about RDF; I myself have trouble understanding it and putting it to use, despite a decade's experience with markup, and I believe the tribulations such folk as I face when trying to deal with it have retarded its adoption significantly.

What happens when this rule about representation is flouted, but standards are published anyway, is standards that fall apart under real-world use. I will adduce OAI-PMH as an example. It follows quite a few of Bosworth's recommendations: it's simple (I have explained it in twenty minutes to library-school students), largely human-readable, focused, precise about encodings, in possession of real implementations, and free on the web.

It is also flawed. Huge projects built on it have found its flaws impossible to bypass and expensive to work around (see Lagoze et al. 2006 for how NSDL ran aground on OAI-PMH's inadequacies).

The major flaw, to my mind, isn't difficult to explain or to understand: OAI-PMH has no error-reporting built in. In a protocol standard built for communication of and about metadata, nobody in the standards-design process ever seems to have asked the (to me) simple and obvious question, "What happens if the metadata is malformed or otherwise wrong?"

Anyone who's worked on the ground with repositories of any stripe knows that metadata problems, sometimes gross problems, are par for the course. For that matter, any librarian can explain the pitfalls of metadata and citation creation at great length. I honestly can't tell you why OAI doesn't seem to have on-the-ground repository managers and other librarians capable of raising such practical issues working on its standards bodies.

I can, however, tell you that they should. The latest OAI development, OAI-ORE, contains exactly the same no-error-reporting weakness I just pointed out in OAI-PMH. Yes, some of the underlying technologies that OAI-ORE is built on contain certain kinds of error reporting, but the aggregation of those errors that can be reported is only a subset of the errors that I believe will crop up.

To make standards that work, include people on the standard-design team who work with the processes underlying the standard. Now that you know this—go forth and standardize!

2 responses so far

« Newer posts