Archive for: August, 2009

Tidbits, 14 August 2009

Aug 14 2009 Published by under Tidbits

I am furloughed today and going out of town, so here, have an early tidbits post.

  • I won't be at the iPRES 2009 conference, but I do recommend looking over the program; it gives a pretty good overview of what digital preservationists think about and study, and what keeps them awake at night. (Midwesterners: the International Digital Curation Conference is coming to Chicago in 2010. I'll be there!)
  • The strength of weak ties: why Twitter matters to scholarly communication. Spot on, and true of FriendFeed as well. This is why, privacy concerns aside, the Facebook acquisition of FriendFeed is a threat; the friends-and-family design limits or eliminates casual elbow-rubbing.
  • Digital Library Services in the Information Arcade from the University of Iowa. This is an e-research service approach worth pondering. Rather than create a digital-curation or digital-humanities outfit from whole cloth, Iowa is adding consulting responsibilities (and additional services TBD, apparently) to an existing service brand whose former responsibilities have to some extent gone elsewhere. I'll be watching this, and I hope Iowa lets us know how it's going. Love the planning wiki, too.
  • Research data preservation and access: the views of researchers. Seems about right to me. Would researchers care to comment in the comments?
  • From SciBling John Wilbanks, Publishing science on the web. I haven't blogged yet about openness and e-research, but I will be, because e-research without openness is so much technology-enhanced window-dressing. Consider John's post a sneak peek at the sort of thing I think about.
  • Reading with Machines. Well-written discussion of where computers fit in textual scholarship, with which I entirely agree. Nice mini-bibliography at the end, too.

I have a few more links in the pipeline, but I think this'll do. Happy (furloughed or not) Friday!

No responses yet

Not turning up our noses

Aug 12 2009 Published by under Tactics

I gave a talk for PALINET some little while ago about institutional repositories. The audience had been primed by the fantastic Peter Murray to think about looking after digital content as the "fourth great wave" of library work. (I wish that talk was online. It was absolutely brilliant.)

But not everyone was entirely onboard with that. I recall distinctly one distinguished-looking white-haired gentleman raising his hand. "We in libraries," he said (paraphrase mine), "have historically been purveyors of quality information. Authoritative information. On what basis should we jeopardize that raison d'etre for institutional repositories?"

Brave man, and he expressed well a resistance I've felt in my librarian colleagues near and far as long as I've been running IRs. Why do you collect that, they ask without asking. IRs established alongside established digital-library programs suffer worse, the parvenu being simply too declassé to mention in the same breath as library-blessed digital collections. The funny thing is, in a lot of these situations I suspect the digital library was resisted for a long time too; I suppose I can only shrug and be mildly pleased that IRs legitimize digital libraries by being the next target of scorn.

The thing is, if libraries are going to involve themselves in digital curation, we'll have to get over our yen for authority and finality. Even, dare I say it, quality.

Part of the reason for this is that in many fields, data-quality standards haven't been worked out yet. Cowboy data curators have to do their best and hope. Over time, this problem is likely to become less salient, which I expect will also lessen librarians' resistance to data curation—but I doubt the issue will ever go entirely away.

A related part of the reason is that data authority is a vexed question, and in most cases (it seems to me) the data will have to be collected and cared for well before the question of authority can be resolved. We just won't know what data are usefully authoritative until the researcher community has chewed them over a bit.

Part of the reason is that if we want decent-quality, well-described data, we just can't sit around until it's final. I've any number of war stories about stupid data that didn't have to be stupid; its collectors just didn't think through what they were doing until it was much, much too late. A librarian—any librarian!—could have asked the right questions and pointed to some of the right answers, but only early enough in the process to ensure that librarianly insights made it into the data-gathering process.

Sometimes, for all our best efforts, we'll find a dataset that needed an intervention that it didn't get. Sometimes, we'll have to sigh and take it anyway. Irreplaceability is one cogent reason to do so.

I expect that many librarians will find this an unpalatable set of outlook changes. The only counter I have is that they are necessary outlook changes if we are to participate in this service cluster.

2 responses so far

Community and archival

Aug 11 2009 Published by under Praxis

FriendFeed, now due to be absorbed into the Borg the Facebook empire, allowed me to lurk on the fringes of the scientific community Cameron Neylon mentions in his post on the takeover.

Insert all the usual clichés here: it was enormously valuable, I learned a lot, and I wouldn't have missed it for the world. My humanities training wouldn't normally gain me entrée into such a circle, and neither would my professional identity. Insofar as I have professional ambitions in scientific data management, every bit of acculturation I can come by is priceless.

That community wasn't the only one I participated in; it wasn't even the one I went to FriendFeed for. Much of the informal Library Society of the World took up residence in FriendFeed after a particularly painful series of Twitter fail-whales, and FriendFeed was pretty good to us. Finding a different community was a bonus!

The writing is on the wall for FriendFeed; it'll limp along for a bit and then be shut down. Sic transit communitas mundi.

I could try to add to Cameron's rundown of the technical features of FriendFeed that make it more attractive than Twitter, but I'll pass, actually; I'm sure others will do that. As for Facebook—fool me twice, shame on me! I don't trust those people as far as I could throw 'em. For Book-of-Trogool purposes, though, I'm interested in this debacle from the perspective of memory organizations, the archival perspective.

Cameron and others are asking FriendFeed to allow them to archive posts and comments there (note to historians: this link will probably rot when FriendFeed dies for good). There is some chest-beating about the value of the content there.

I want to draw a distinction between personal value, community value, and archival value. Items of considerable personal value may have limited or no community value. Items of considerable personal or community value may have limited archival value—archival space and attention are not infinite (and growing more finite by the day). Archival value is often hugely overestimated.

This seems like a truism, but consider how many people leave lengthy runs of National Geographic magazines at libraries because although they have no personal value to the donor, surely they must have archival value! (Note to my readers: please don't do that! Please. Have the guts to throw the things out. That's all the library is going to do with them.)

So where is FriendFeed on this scale? For me personally, the value of the content I have put there is so low that I'm not planning to archive it. (I have a somewhat laissez-faire attitude toward life-archiving anyway; I have no ambition to appear in history books.) Likewise, to me, the value of the community content. The community interaction has been hugely valuable to me, and I hope it can survive FriendFeed's demise, but the frozen remains of that interaction? Limited if any value (again, to me; I don't argue with Cameron's or anyone else's value perceptions).

If we are to estimate the archival value of FriendFeed interactions, I think we need to ask: how much research work is happening here that happens nowhere else and that can inform further research work? The second criterion is crucial. If it doesn't create additional knowledge, it's not worth archiving. Harsh… but archival space and attention are not infinite.

Sorry, sociologists and historians of science: I don't think FriendFeed makes the cut. A lot of social software doesn't, especially considering the difficulty of archiving it at all. Archival is not typically a desideratum of these systems (and I frankly maintain that Facebook's stickiness regarding personal information is one reason I left it after zeroing out my profile), so it takes real effort to save anything.

Blogs and wikis may well make the cut—not en masse, to be sure, but on an individual basis. I've argued before and doubtless will again that libraries need to look seriously at their faculty's blogs, hosted in institution-space or no. The same questions as above are important. If it helps, think of blogs as gray literature, much of which absolutely has archival value.

Geoffrey Bilder tweeted today "When people say 'persistent' or 'sustainable', what they often really mean is 'for as long as I am actively interested in it.'" I think that's absolutely right and absolutely important. Interest wanes. It is the archivist's job to make educated guesses about the needles-in-haystacks in which interest will not wane.

No responses yet

Unpacking "the cloud"

Aug 08 2009 Published by under Praxis

I hear talk about "the cloud" as the solution to research data curation. Data will waft softly up into "the cloud," and "the cloud" will look after it and give it back on demand, and there will be unicorns and rainbows and rainbow-colored unicorns, and—well, you get the idea.

I think this is bosh. Balderdash. Bunkum. But I also think it's worth unpacking why this is a popular and recurring idea, because there's the germ of a service design in there.

"The cloud" means a lot of things to a lot of people, but for the sake of argument, let's call it "third-party data-storage services" such as Amazon's S3. S3 is not a solution for data curation. The service-level agreement amounts to "we can lose any of your data any time, and your only recourse might be a refund of what you paid us." For unique, irreplaceable data, this is beyond unacceptable. Think it can't happen? It already has.

As part of a well-managed storage and backup system, S3 might do. Might. But do you really want to design around its limitations?

However. Look up at the sky, if you're lucky enough to be near a window. I'm guessing you see either no clouds at all, or a lot of them. More than one, at any rate. How many skies contain just one cloud?

Cost questions aside, what is it that "the cloud" promises that people want? Could those of us interested in data build that?

"The cloud" promises to make data storage secure, safe, and above all easy. Yes, I think we can do this, and I think we should. Fedora, IRODS, pick your poison—but big disk, taken care of invisibly behind the scenes, with lots and lots of ways to get data in and out?

We can do this. We should.

4 responses so far

Equipment and data curation

Aug 07 2009 Published by under Praxis

Monado of Science Notes commented on my

irreplaceable-data post thusly:

It sounds as if the best thing to do in the short term is not throw away the old equipment. And to use the old equipment to copy digital media to newer forms... for which no one ever gets a budget, right?

It's such a great comment that I want to unpack it a bit. As we work out our data praxis, this kind of question is exactly what we have to confront.

My first question is simple: What equipment are we talking about here? Using what media?

Libraries are wearily familiar with this question in (mostly) analog terms. We have microfilm, microform, and microfiche, and for all their wonderful archival qualities, they're useless without the accompanying machines. We have sound recordings in everything from wax cylinder to vinyl to eight-track tape to cassette to digital—and in a few cases, we are stuck with analog media we can't actually use, largely in hope that someday, somehow, money and opportunity will turn up to get the priceless information into a form usable now.

(Thought data-death was purely a digital phenomenon? Goodness, no.)

But there's another axis to think about our equipment on: data production versus data retention. Our data production equipment may be as great a threat to the viability of the data produced as anything else. Instrument scientists, this means you. What is your instrument putting on your hard drive? Can anything besides your instrument and its bespoke software read it? If not, welcome to dusty data death.

This is just a specific instance of a general rule: for best performance, prefer open formats to proprietary, standardized and documented formats to the reverse, and popular formats to niche ones. Data persistence is a crapshoot. Load the dice.

Monado's question, though, was about data retention equipment. My answer to this is actually relatively simple. All physical media fail and/or become obsolete. Don't choose a physical medium based on its purported longevity; gold CDs are not a panacea! Pick a physical medium based on recoverability and ease of migration instead.

Recoverability first. To my mind, this has two parts: noticing problems and fixing them once they've been noticed. Gold CDs are horrible for noticing problems; to audit a collection of them, a human being has to sit down at a computer, pop each one in, and test it. Zip drives, Jaz drives, floppy drives, USB sticks—same problem. They're hard to audit, so nobody audits them, so they fail silently, so the data on them gets lost. And that's assuming the equipment to read them remains commercially available! (I got so burned by SyQuest… lost pretty much my entire undergraduate output. This is, I hasten to admit, no great loss to humanity, but it still hurts me personally.)

Fixing a problem once it's been noticed is the provenance of a good backup system. Got one of those? I hope so.

Ease of migration should be fairly self-explanatory. The easier it is to move your data, the more likely you are to do it when need be. The easier (and less disruptive) it is to swap out a failing or obsolescing bit of your data infrastructure, the better. CDs and DVDs fail on this account, too; copying them is slow and requires a lot of human intervention.

The current state of the art is spinning-disk (with all appropriate reliability measures) with a backup system. The backup system has to my understanding typically been optical magneto-optical tape (thank you, commenter Markk), but "the cloud" is emerging (somewhat to my personal dismay) as an alternative. (Why does the cloud dismay me? Because it's not making any reliability or sustainability claims yet. This may change—and anyway, what sky has only one cloud? The ideas behind the cloud are good; the implementation just needs work.)

In sum: the equipment you use matters a great deal to the longevity potential of your data. Choose wisely!

2 responses so far


Aug 06 2009 Published by under Tactics

Unconnected incidents are making me ponder questions of sustainability. I don't have any answers, but I can at least unburden myself of some frustrations!

I learned from a colleague that arXiv is looking for a new funding model, as Cornell is wearying of picking up the entire tab. Various options are on the table, and I'm not competent to opine on their feasibility. I'm more interested in the larger question: how are we, we libraries and we researchers, organizing to shoulder the burden of electronic archives, especially open-access ones?

Historically, the answer has been "not effectively." I can name scads of dead digital projects without having to think hard, and I daresay you can too. This is no longer an acceptable answer, if indeed it ever was. I'm just a little bemused and worried about the models that seem to be emerging—again, especially for open-access archives. If Cornell can't underwrite arXiv, arguably the most successful preprint archive ever, what does that mean for disciplinary repositories generally? (See also the move of OAIster from the University of Michigan to OCLC.) What does it mean for library support of open access, to data as well as documents?

There's more in those questions than I can unpack in a single post, so suffice it to say I think librarianship's stance on this question is a bellwether. Are we tethered to the past or working for the future? Are we really memory organizations, or are we only memory organizations for print? Will we pay for human access to knowledge, or only institutional access?

So that was one thing. Here's another.

In the course of my regular work, I had occasion to look for a long-term home for an item originating outside my institution. This, you see, is one peril of running an institutional repository; the mission is strictly constrained to materials originating (in some fashion) inside the institution. No matter how amazing that item was, I can't make an easy case for accepting it, and I may not be able to make any case at all.

So, all right, there may be an appropriate disciplinary repository somewhere. I went looking (ROAR, OAD, and the Goog) and found two possibilities. One restricted depositors by geographic origin; I sent it back to my correspondent, as I didn't know the origin of the ultimate requester.

The other… well, the other is the reason I'm being cagey around identifying the requester and the discipline. The other appears to have been hacked up by two people in their spare time. The two people got a couple of publications out of the attempt—and then they abandoned the repository; it hasn't seen any action in well over a year.

I have no words for this irresponsibility that are printable. The word "repository" gets kicked around a lot—it's not my favorite word either—but responsibility and sustainability are, I believe, two concepts commonly associated with it. Whomping up a repository on a lark and then leaving it to die is a betrayal of trust. I don't approve. Worst of all, this so-called repository is essentially cybersquatting; no one else will take another stab at making a home for materials in this discipline while the repository is still (however marginally) extant.

This is nothing I haven't said before:

When organizations fail

When even scholars wanting to do the right thing and hand off their work to a responsible party cannot find anywhere to go, when enabling digital communication and the preservation of its results is an altruistic act in libraries instead of the bedrock of our mission, when worthy digital projects die because we in libraries do not notice and reach out to them, when we ourselves can't see our way clear to sustaining digital materials… we have a serious systemic problem.

6 responses so far

Help Louisville... and think about your data

Aug 05 2009 Published by under Miscellanea, Praxis

Yesterday the city of Louisville suffered a freak thunderstorm that dumped half a foot of rain in an hour and a quarter. Their library has been devastated, to the tune of a million-plus dollars in damage.

As a proud member of The Library Society of the World (and I have the Cod of Ethics to prove it!), I ask anyone who is able to throw a few bucks their way. I trust Steve Lawson to do as he says he'll do.

The library's data center and systems office were on its ground floor. If you watch Greg Schwartz's Twitterstream you can keep up with the recovery efforts. For my purposes, though, I want you to think very hard for a moment about your data, keeping the Louisville Free Public Library's experience in mind.

  • Where are your data? Do you have them on your hard drive? Are they on a departmental or campus server? Where is that? What natural or manmade disasters is it vulnerable to?
  • Geographically-dispersed backups, do you have them?
  • If you're relying on a third-party service, has it promised you anything about reliability, or the ability to get your data back out?

These are basic, basic questions, folks. If you can't answer them appropriately for your most important data, call in an expert yesterday to get the problem fixed. (Nota bene: graduate students are not experts.)

11 responses so far

A something about me

Aug 02 2009 Published by under Metablogging

I cringe. I've accepted an invitation to speak somewhere, and an email comes back asking me politely for a bio. Cringe. Every single time. It's downright Pavlovian.

I loathe, despise, abominate, and abhor writing professional bios.

However. There's a point to the exercise: situating myself in context, so that folk can decide whether I'm worth listening to in the first place, evaluate my expertise and my biases, and make an educated guess about what questions to ask me that I can actually answer. So now that I'm starting to settle down here in my new ScienceBlogs digs, it's time for (dramatic prairie-dog chord) the bio.

Fortunately, I've all the space I care to use, and no need to maintain a strictly buttoned-down speech style. I hope that will make this easier…

I majored in comparative literature (specifically, inter-arts studies; even more specifically than that, literature and music in the Western European Middle Ages) and Spanish with a linguistics minor at Indiana University–Bloomington, a wonderful place for a curious and moderately versatile undergraduate. While I was there, I worked as a campus computing-lab attendant (oh, the wonder of the Mac Centris that could play music CDs during the Tuesday graveyard shift in Lindley Hall! wait, am I dating myself here?) and was introduced by a wise and prescient mentor to the Perseus Project, the Thesaurus Musicarum Latinarum, and other early exemplars of what we now call the digital humanities.

"That's the future," he told me bluntly over coffee (his) and soda (mine), thumping the table with both hands as was his way. "Electronic projects are the future of the humanities." I was too young and awestruck to do other than nod dumbly, not quite understanding what I was hearing, but the memory lingers… and who is to say? Perhaps I wouldn't be doing what I do without that.

Only about eight months separated my graduation from Indiana and my start as a graduate student in the Department of Spanish and Portuguese at the University of Wisconsin at Madison. In hindsight, I should of course have waited. In hindsight… I should have done many things differently. Ah, well. Long story short: four and a half years later I left, shell-shocked and miserable, just short of taking Ph.D comps.

I had, however, learned a few useful things in that otherwise wretched and fruitless interim. I learned some paleography and lexicography. I learned the basics of teaching in the college classroom. I learned manuscript transcription, though not in TEI. Importantly, I learned the amazing raw power that digitized text offers the linguist.

(Consider, for example, this tour-de-force dissertation from 1912. Give me digitized texts and a Python IDE, and I can do this work more accurately and completely in vastly less time, because the only labor I will be putting in is the intellectual labor of defining words. The computer is vastly better at collation than I can ever be.)

After a few months' worth of temping, my experience with medieval-manuscript transcription landed me a steady job. (How many people can say that? I wonder.) For the princely sum of $9.53 an hour to start, I was to help route scholarly journal and book manuscripts through an SGML-based typesetting workflow.

I learned SGML and fell utterly in love with it, DTDs and all. I learned regular expressions and loved them (and yes, I do long for the XKCD regex shirt, why do you ask?). When I tried to learn Perl it didn't stick… but I learned Python and got along moderately well with it, as I do to this day.

With my usual ability to dogpaddle in waters considerably over my head, I somehow landed on the Open eBook Publication Structure working group during this time. That was a heady experience, debating the minutiae of XML namespaces and the practicalities of book-production workflows with engineers from Microsoft and XEROX PAL as well as sharp entrepreneurs, and brilliant markup wonks such as Allen Renear and Steve DeRose. Perhaps someone could exist in that rarefied environment and learn nothing… but that someone was not me.

All things come to their destined ends, however, and the ebook boomlet of the early 2000s was no exception. I landed with the Puerto Rico Census Project while I sorted out what to do next, learning a bit about Microsoft Access and Visual Basic that I've since more-or-less successfully scrubbed from my brain. Then came library school, and after that my current run of some four years' length running various institutional repositories.

So, all right, what has all this to do with cyberinfrastructure and data curation?

The thread that runs through all my motley history is the preservation and availability of digital data for future uses. I loved SGML because warts and all, it was an elegant, future-conscious representation of digital text. I'm fond of the business of keeping digital artifacts viable and usable, and I stubbornly and in the face of considerable evidence to the contrary believe it a growth industry.

There. That's my bio. Hope it's done its job.

No responses yet

« Newer posts