On open data and disruptive innovation

BioMed Central recently issued a draft statement on open data. The details aren't earthshaking; you can read them yourself if you care to.

What I'm interested in is whether this manoeuvre puts BMC in a good position to disrupt other journals, particularly those announcing that researchers with supplementary material can go climb trees.

I'll repeat myself:

See, one of the lesser-known bits of Christensen’s market-disruption pattern is that the disrupting force needs to start out by “competing against nonconsumption.” You can’t take on the incumbent on its own turf; the incumbent will eat your lunch and you for dessert. (What’s the lesson for institutional repositories here? Starting with peer-reviewed journal articles was a doomed strategy, that’s what. Those are the crown jewels. The incumbents own those.) You have to find something else to work with, something unused or underserved that the incumbents turn up their august noses at—a low-end market, a different raw material—establish a market beachhead there, and expand your beachhead over time.

I do think it significant that it should be an open-access publisher taking up the data gauntlet when toll-access publishers won't. There's an asymmetry there worth examining.

For a toll-access publisher, data is a cost center, pure and simple, and one that they can't make any additional money from. Why can't they? Because their business model is based on closing off access, and closed-but-nominally-published data is becoming more useless by the day. If a research dataset can't be found quickly and computed upon at will—someday this may be translatable to "if it's not part of the linked-data web"—there's no point to "publishing" it. So our toll-access publisher has three unappetizing choices: refusing the data (in which case researchers who value their data may go elsewhere), opening the data (which will lead to awkward questions about why the accompanying papers aren't open too), or taking on a significant new cost center just to keep up with the BioMed Centrals of this world.

(Yes, there are some flourishing industries based on closed access to data, granted. They're separate from journal publishing, and they presume a type or quantity of data that people will pay for. I don't think that's true of most research-generated datasets, so my argument should hold up. As usual, though, my crystal ball is cracked and hazy, so you follow any prophecies it generates at your own risk.)

The big author-fee-supported open-access publishers, in my estimation, are focused on gaining market share right now, where "market" is defined as "attracting publishable submissions." So doing something smart with data looks likely to turn into a competitive advantage for them, and if it costs them too much, well, they're already charging, so they just figure out how to adjust their fee structure, no big deal.

Will PLoS and Hindawi follow BMC's example? I don't know. They should. Will data become the wedge that allows open-access journals to disrupt their toll-access counterparts? I don't know, and it will likely be some time before that can be assessed. The likeliest outcome is the hoary boring old "it will differ by discipline." Time will tell whether BMC has bet on the right disciplines.

Friday foolery: classification and thesaurus follies

So I've talked a lot here about classification in libraries, why libraries do it and how it works.

Sometimes, though? Really, truly doesn't work all that well.

I've librariansplained thesauri too. At some point in very complicated thesauri, following the chain of broader/narrower terms can lead to patent absurdities. Ah, well. Human systems are imperfect.

Tidbits, 9 September 2010

Looks like most of the server disturbance is history. This is a good thing! We shall celebrate with tidbits.

The best way to bring something to my attention for a tidbits run is to leave a comment!

Institutional repositories and digital preservation

With all the pressing issues the open-access movement has to deal with, I honestly don't understand why we scrap over digital preservation. But scrap we darned well do, so I'll toss my two coppers in the pot.

They amount to this. Digital preservation is not a single thing one does or doesn't do; it's a whole constellation of things, some of which matter more than others. By and large, considering real-world threats instead of playing digital security theatre, institutional repositories do fairly well at digital preservation. They could (and would) do better if institutional-repository software integrated better with file-format analysis tools.

I have no patience for "it's about open access, not digital preservation!" arguments. There is no access, open or otherwise, without at least basic preservation steps. We can see this principle in action, even: the disappearance of DList (the US library and information science repository) and Mana'o (a disciplinary repository for anthropology) removed quite a bit of material from the public eye.

Likewise, I have no patience for thinking of digital preservation solely in terms of technology. DList and Mana'o are the biggest, most glaring examples of access failure in the repository realm. (We don't actually know whether they were full-on preservation failures; the content may still exist out of sight somewhere. Or it may not, in which case we indeed have a failure of preservation.) In both cases the failure had nothing to do with technology: it was organizational and business-model failure. Both DList and Mana'o started as single-person projects. Neither made adequate contingency plans for the obvious risks of letting repository survival depend on a single person. The single person ran into time and energy limits. Nobody picked up the slack. The repositories died. QED.

(Think it can't happen to you? Ask yourself what would have happened to arXiv when Ginsparg got tired of it if Cornell University Libraries hadn't white-knightly charged in. I think it would have died too, myself.)

So if the major observed risk to content preservation is failure of organizational support, IRs hold up pretty well. I've been quite caustic in my time about institutions' and libraries' failure to support IRs adequately (and sadly, I have another acerbic article brewing in the back of my head) but I will happily say that I've never seen or heard of an IR whose sponsors weren't aware that they were taking on a serious obligation to the content they collect. Score one—and it's a big one—for the humble IR.

Regarding technology-specific threats, most IRs are far from perfect, but they're a good deal better than nothing. DSpace IRs, for example, do checksums on everything they ingest, and those checksums can be regularly audited. Assuming halfway-decent backup behavior (and yes, this is an assumption), this reduces bitrot danger to near zero. File-format obsolescence is often remarked upon as a problem, and it is true that IR software does not do all it should with tools like JHOVE designed to evaluate file formats and point out problematic files. Frankly, though, I'm with David Rosenthal and Chris Rusbridge on this one: mass-market file formats such as most IRs contain rarely become completely unreadable. Information loss, sure (fonts and formulas particularly), but not often and not much.

IRs could also stand to do better at geographic replication of their contents… but once again, this is an organizational issue, not a technology one. It's been addressed in a few cases, so we know pretty well how to do it; our organizations just aren't stepping up yet. I think the Duraspace cloud efforts have brought this question to the front burner, and I expect matters to improve within the next year or two.

Finally, an oft-forgotten part of the IR preservation strategy is the human beings behind IRs. By way of example, I've adopted a few websites into the IRs that I've run. Before they go in, I check internal links, I remove unnecessary Flash (practically all of it, that is) with extreme prejudice, and I clean up unnecessarily nasty HTML. I'm pretty confident those sites will do all right for quite a long time because of my interventions.

So can we stop arguing about digital preservation now, please? Plenty more productive arguments we could be having.

Penny wise, systemically foolish

On the Twitter-river this morning, a tidbit bobbed by about an academic library told to earn revenue by renting out its meeting spaces to campus constituencies. (I decline to name the library lest the tidbit-teller land in difficulties.)

Bluntly: the just-mentioned suggestion is stunningly myopic and does not heed mission. For the record: among the mission-driven tasks of the library is the equalization of resources among its patrons, the strong resistance to creating means tests for access. That goes as much for meeting space as for any other resource.

Kathleen Fitzpatrick commented, "The entire university system has become a demented revenue generation machine, with one branch being extorted by another and looking for ways to pass those charges on." I think that is just exactly right. What the accountants who often make this bizarrerie happen don't ever seem to consider (nor does anybody in authority ever lay at their door) is the overhead costs, the friction created. Is the added effort of shuffling money around internally really creating more money, or anything else of value, for the institution? Whom does it block from the action? Wouldn't the whole system run better, cheaper, and more fairly with less money-shuffling friction?

This gives me to think seriously about some of the trends in open-access-space. Barbara Fister crystallized some of my internal disquiet in her discussion of markets versus missions, and I recommend that post (at the risk of self-aggrandizement; it links back here) as a complement to this one.

By way of example, consider the arXiv. In contrast to nearly all voluntary green open-access efforts, it's wildly successful. (In fact, I would say that inappropriate generalization from the arXiv community has been one of green open access's worst bêtes noires. For all the "physics envy" noise, faculty generally won't do things just because the physicists do.) It's a feather in Cornell's libraries' cap.

But prestige and success aren't enough to guarantee underwriting from its host institution. The arXiv is now asking for support from other libraries. To give credit where it's due, many other libraries are ponying up. (Gee, I don't see you on the list, Yale University Libraries. Got no physicists, mathematicians, or computer scientists? No, no, no, I know you're just open-access slackers… and I for one will continue to call you out on it until you stop slacking.)

To me, this is the library-space-rental problem writ large. It's undoubtedly cheaper overall if Cornell pays for arXiv and its peer libraries make analogous investments in complementary services. It costs real money in overhead to create the kind of "sustainability" envisioned by many experts, in which services like arXiv constantly have to run on the fundraising hamster wheel, and libraries have to budget not for one service that they offer, but for tiny bits of many services that they contribute to.

Moreover, I see more incentive for cost-containment in the closely-owned services model than in the hamster-wheel model. Let us postulate a Cornell University Libraries that suddenly discovers that they can bring in external money for the arXiv. Why contain costs? Why not just raise membership fees? Whereas if the money comes from inside, efficiency is a much greater consideration. To be sure, efficiency becomes service-starvation when taken too far, but from an overall-system perspective, isn't efficiency what we want?

Closely-owned services are an even harder question for smaller libraries than for research libraries, of course. Their consortia likely have to be part of the answer, and we may have to consider hybrid sustainability programs, in which small bugs pay into a fund that the big bugs with the staff to run programs can draw some support from.

Maybe I'm mad, maybe I'm naïve, but I have to hope—have to believe—we can find better paths, more system-aware and less myopic paths, to open-access sustainability than the everlasting hamster wheel of fundraising and friction-heavy money-shuffling. And I do still believe that part of the answer is assessing investment and calling out slackers (YALE).

Hosting issues: no comments for now

Faithful readers may have noticed that Scientopia's hosting provider pulled our plug again today. We're working on fixes, but in the meantime, we've been asked to shut off comments, which I have duly done. With luck, this restriction will be lifted within a week or so, and I do apologize for it.

You can always get hold of me at dorothea.salo at gmail. And while I'm mentioning that, I want to thank Lynn Yarmey of Stanford, John Doyle of the National Library of Medicine, and Dave Nichols of the University of Waikato with all my heart for cogent and useful comments on my data-management presentation via email and Twitter. I have incorporated many of those comments into the presentation; expect a revised version shortly!

