Archive for: November, 2009

Another reason data services need librarians

Nov 30 2009 Published by under Tactics

Some people watch football over Thanksgiving weekend; I get into discussions of disciplinary data regimes with fellow SciBling Christina and others on FriendFeed. Judge me if you must!

Another common truism in both the repository and data-management fields is that disciplinary affiliation accounts for a lot of the variation in observed researcher behavior. For once, I have no quarrel with the truism; it is unassailably the case. The wise data curator, then, knows some things about disciplinary practices going in.

But what things, exactly?

I don't believe that taxonomy exists yet; it'd be an awfully fruitful thing to research. Based on learning from Christina about chemistry and reading quite a few reports of late (including the one she references), here's my wild stab at a beginning:

  • Is research in this field collaborative or lone-wolf? In astronomy, everybody relies on everybody else's data; as data tend to be so expensive to collect and store, there's hardly any other way to work. In chemistry, lone-wolf labs tend to clutch data close.
  • What are the ties with industry like in this field? The closer those ties, the likelier researchers are to muzzle data, fearing that a scoop will mean monetary as well as career damage.
  • What are the ethical and legal constraints on data sharing? Sometimes these are obvious, as with human-subjects research. Sometimes they're less obvious but very, very salient.
  • Are existing players selling data? Which data? Self-explanatory, I hope.
  • How standardized are research practices in this field? How digital? Less important to data sharing per se than to the type, quality, and reusability of the data one is likely to encounter.

None of these factors is determinative, and all have counterexamples—even high-dollar industry doesn't always muzzle data, as genomics bears witness.

Here's the thing. I myself can answer the above questions for only a couple-three fields. If I'm responsible for a campus-wide data service, I need answers for every field the campus does research in!

Moreover, the answers to some of the above questions are ticklish and politically charged, not to mention that your average researcher will expect me to have those answers up-front. I don't want to question a researcher's favorite society's practices in a discussion with that researcher! Likely as not I'll lose the researcher's goodwill on the instant.

So until disciplines are clearly mapped out with respect to the above questions, or appropriate approaches determined for me by well-conducted research, I need a third party with a broad perspective to ask. That, to me, spells "librarian."

One response so far

Welcome Planet Code4Lib readers!

Nov 28 2009 Published by under Metablogging

Book of Trogool has just been added to Planet Code4Lib, a library-technology blog reader. I am of course honored to be in some very fine company.

I have a mixed readership here: librarians, technology pros, researchers from several disciplines. I encourage all my readers to pop over to take a look at Planet Code4Lib.

If you're not a librarian, chances are that your image of the library and the librarians who staff it is… well, a bit fusty and out-of-date. Planet Code4Lib will open your eyes in a hurry. Do we do the things you think we do? Well, yes, probably. But that's not all we do.

If you're a technology pro working with librarians, figuring out how we think can be a burden. Planet Code4Lib is a marvelous bridge. Along with the hardcore techies, it includes cataloguing and metadata practitioners, vendor representatives, and a few public-service librarians. Follow Planet Code4Lib for a while; you'll learn library jargon and the latest discussions in the field by osmosis.

If you are a librarian, it can be hard to keep up with the technotalk, even enough for basic professional awareness. Chances are you'll find a few blogs on Planet Code4Lib that explain matters clearly, credibly, comprehensively, and comprehensibly. Perhaps you don't need to follow the entire Planet, but it's fantastic for finding the right library-tech blogs.

Welcome, also, to readers coming to Book of Trogool for the first time through Planet Code4Lib. Glad to have you here; stick around and leave some comments!

No responses yet

Peer review, data quality, and usage metrics

Nov 24 2009 Published by under Praxis

Another case of things connecting up oddly in my head—

"How do we know whether a dataset is any good?" is a vexed question in this space. Because the academy is accustomed to answering quality questions with peer review, peer review is sometimes adduced as part of the solution for data as well.

As Gideon Burton trenchantly points out, peer review isn't all it's cracked up to be, viewed strictly from the quality-metering point of view. It's known to be biased along various axes, fares poorly on consistency metrics, is game-able and gamed (more by reviewers than reviewees, but even so), and favors the intellectual status quo.

More pragmatically, it is also overwhelmed, and that's just considering the publishing system. I recently had a journal editor apologize for sending me a second article to review in the space of roughly a couple of months. At the risk of being deluged with review requests, I will say that the apology surprised me, because I don't do much article reviewing and was happy to take on another review—but the fact of the apology will serve as an indicator of a severely overburdened system.

We can't add data to the peer-review process. We can't even manage publishing with the peer-review process, it begins to seem. So where does that leave us?

Well, to start with, consider the difficulty of knowing how many have read a print journal article. Library privacy policies aside, counting such things as copies of journals left to be reshelved offers no usage data whatever on the level of the individual article. (This, of course, is one of the fatal flaws of impact-factor measurements as they are currently conducted.) So we have contented ourselves with various proxies: subscriptions as a proxy for individual readership, library reshelvings as a proxy for use, citations as a proxy for influence (which is somewhat more defensible, at least on the individual article level, but not without its own inadequacies), and so on.

Proxies. Heuristics. Because we can't get at the information we really want: how much does this article matter to the progress of knowledge?

Let me advance the notion that for digital data, especially open data, the proof of the pudding may actually be in the eating.

How do we know ab initio that a dataset is accurately collected, useful, and untainted by fraud? Well, we don't. But datasets when used have a habit of exposing their own inadequacies, if any. I know, for example, that Google Maps has a dubious notion of Milwaukee freeways because I once nearly missed my exit when Google Maps erroneously said it was a left exit. Judgment through use and experience.

I believe there is also a curious and potentially useful asymmetry between how publications and data are used. If I disagree with an article in an article I write, I still have to cite the article I disagree with. If I see a bad dataset, I don't have to cite it—I'm far more likely simply to disregard it, use data I do believe in. (This is probably an oversimplification; I can also try to discredit the dataset, or perhaps collect my own, better one. But I suspect the default reaction to faulty data will turn out to be ignoring it.)

Likewise, data may improve through use and feedback, as in many fields they are less fixed than publications. "I would like to mash up my data with yours, but I'm missing one crucial variable," may not be an insuperable difficulty! We can even see this winnowing function in action, as various governments and major newspapers start to release data and respond to critiques and requests. Even in libraries this process is ongoing, as we confront the mismatches between our current data standards and practices and the uses we wish to make of them.

If I am right, data usage metrics and citation standards for data take on new importance. How often a dataset has been directly used may turn out to be a far more useful heuristic for judging its quality than analogous heuristics in publishing have been… and best of all, if we manage citation with any agility whatever (a big if, I grant), use is a passively-gathered heuristic from the point of view of researchers, unlike peer review.

Elegant. I hope this is right. Time will tell, as always.

One response so far

Sustainability: the institutional fiefdom

Nov 24 2009 Published by under Praxis

Some interesting ferment happening in repository-land, notably this discussion of various types and scales of repositories and how successful they can expect to be given the structural conditions in which they are embedded.

I don't blog repositories per se any more, so I'm not going to address the paper in detail (though I do think it contains serious oversights). What I'm curious about in the Trogool context is the case of institutionally-hosted services aimed not specifically at the institution, but at a particular discipline.

arXiv. ARTFL. PERSEUS. DRYAD. There's any number of these. One can't call them "institutional" repositories. One can't quite call them "disciplinary" repositories, either, because that implies a source of financial support beyond the institution.

Another class of these resources is not open-access, incidentally; it uses subscriptions to support further additions to the corpus. The Brown Women Writers Project is an example. Much of the "sustainability" talk coming out of think-tanks like Ithaka respects and promotes this business model. I do think it important to note that the institution hardly disappears from the support infrastructure when the subscription dollars start rolling in (assuming, of course, that they do).

I heard yesterday that one such corpus, while of impressive quality and very highly regarded in the discipline, was all but invisible on its home campus, according to the corpus's own staff. Basically, these projects are what I have previously called fiefdoms. (If you don't like that word, you may wish to substitute "research lab." Most of what I'll say applies to them as well.)

Sustainability is the crucial flaw in any sort of fiefdom model for data management. Most fiefdoms get the ball rolling with grant money. This may commit the institution to a certain amount of financial or in-kind support (depending on what the grant spells out), or it may not. If it does, that institutional support lasts only as long as the grant does. No one in this cycle—not the researchers in the fiefdom, not the institution, not the grant agency, no one—takes responsibility for the post-grant existence of anything the fiefdom produces.

For some projects, that's fine. Software projects can cast their code upon the open-source waters, or sell it to industry. Projects that are easily print-publishable can be published. Projects that have dollar signs attached to them can go to the tech-transfer office (though I share the general dubiety that tech-transfer offices are a net win for institutions).

For nearly all digital projects, the fiefdom model is a disaster. Fiefdoms live brief lives, die quiet deaths. Many fly under the radar; asking too loudly or too often for institutional support risks the institution looking down its nose and shutting the fiefdom down.

Arguably, institutions should not do this. Institutions, however, can be remarkably myopic about discipline-oriented behaviors—any behavior that doesn't directly and obviously benefit the institution as a whole, really. One of my favorite examples can be found in the Ithaka report on university presses, in which university provosts loudly trumpeted the necessity of (other institutions', presumably?) university presses to their local scholars, but declined to continue supporting those presses locally, as they were perceived as frills, not strictly necessary to institutional continuance.

As usual, I adduce the library as the institutional component whose mission and funding are best-placed to address this gaping hole in the data-management framework. What academic libraries appear to lack, unfortunately, is the will to step forward and accept this responsibility.

I have no answer, then.

This Thursday is Thanksgiving in the States, and I am furloughed on Friday, so I'll be visiting friends. I'll try to queue up a tidbits post for Friday or the weekend.

No responses yet

Tidbits, 20 November 2009

Nov 20 2009 Published by under Tidbits

Have some Friday tidbits!

  • An important biology dataset is losing NSF funding and may fold. Nor (as the article explains) is it the only one. It is impossible to overstate the desperate gravity of the data-sustainability question. Academic libraries, if we are not the white knights here—and we certainly have been in the past; witness arXiv—who is?
  • On a similar theme, Yahoo pulls the plug on GeoCities. O ye researchers relying on consumer-grade web services, or new startups, have an exit strategy! Consumer-grade services die when they lose money. Jason Scott may not come charging to your rescue.
  • H1N1 science depends on a public database of flu immunity data. "As the researchers acknowledge in their paper, the work couldn't have taken place if it weren't for extensive data sharing within the community of flu virus researchers." Data sharing makes possible better, faster science.
  • Data and the journal article. First: if you are saving your data as PDF, stop it. Second: as I suggested to Chris on FriendFeed, there's a serious structural issue with expecting journal publishers to cope with appropriate data archiving: by the time a researcher chooses a journal to publish in, all the decisions about data gathering and representation have already been made—and they may well have been made badly. The poor journal publisher can't go back in time and fix bad decisions! In our not-yet-standardized data age, early data interventions have to happen close to the researcher, which to me means they need to happen at the institution where the research happens.
  • The need for clear data licenses. I haven't talked about data licensing here, partly because the current state of intellectual-property law makes me sick at heart, but there's no question that it's an important piece of the data puzzle.
  • Peer-to-peer technology used for the forces of good: BioTorrents. Datasets vary in size; for the large ones, network latency becomes a sharing problem. Torrenting won't precisely solve the problem, but it certainly increases the size range within which datasets are portable.
  • Fascinating data project of the week: National Center for Ecological Analysis and Synthesis. What caught my attention is that as I read the project description, it takes public data sharing for granted. NCEAS researchers are not generating data; they are mining existing data. I'm inordinately curious about the disciplinary culture that makes this a feasible thing: what price scooping?

Whew. I have a lot more, but it's Friday.

No responses yet

... and then what?

Nov 17 2009 Published by under Tactics

It can be difficult to convince present-focused researchers to give a long-term perspective, such as that of a librarian or archivist, the time of day. (So to speak.) Here's my favorite way to do it: the "… and then what?" game.

You have digital data. You think it's important. We'll start from there.

  • Your grant runs out… and then what?
  • The graduate student who's been doing all the data-management chores leaves with Ph.D in hand… and then what?
  • Your favorite grant agency institutes a data-sustainability requirement for all grants… and then what?
  • Your lab's PI retires… and then what?
  • Your instrument manufacturer or favorite software's developer goes out of business… and then what?
  • Your whomped-up next-door data center burns up, falls down, then sinks into the swamp… and then what?

You get the idea. No far-fetched catastrophizing, just all-too-plausible scenarios that researchers really ought to have thought about already but usually haven't. If your service can position itself as the "… and then what," you're on to something.

No responses yet

Tracking my eyes

Nov 16 2009 Published by under Metablogging

I got a very nice email the other day thanking me for being a clearinghouse for e-research information. I'm not quite sure I am that, but just in case I've become it without noticing…

What I read in the area and think is worthwhile enough to keep around ends up in a few places, all of which have RSS feeds:

Happy to share these, and also happy to start up a Zotero group if anyone else is interested in contributing items thereto!

(By the way, one rather annoying thing about the Zotero feed—I almost always save copies of the item along with the item record, and Zotero dumps both into the RSS feed, which from the consuming end looks like a lot of unnecessary duplication. I apologize for this, and wish Zotero would fix it.)

3 responses so far

The basic carrot: usage statistics

Nov 16 2009 Published by under Tactics

BMC Bioinformatics published this article describing a "data publishing framework" for biodiversity data.

Stripped to its essentials, this article is about carrots for data sharing. Acknowledging that cultural inertia (some of it well-founded) militates against spontaneous data sharing, the authors suggest a way forward.

I'm calling this one out because it has implications for storage-system design. The authors want three things for their public data: persistent identifiers, citation mechanisms, and data usage information.

(For once, I feel good about institutional repositories: they swing two out of three at the minimum, and some manage all three!)

Persistent identifiers seem simple but aren't, necessarily. For example, does a constantly-changing dataset get a persistent identifier? How does that identifier know what it's identifying, in that case? Should a persistent identifier be just a URL? What if the domain name goes away or changes? (This is not an idle concern; the University of Illinois, for example, just changed its top-level domain, and the institutional repository I run is eventually going to lose its separate domain entirely.) What, exactly, gets a persistent identifier? The entire dataset? Files within it? Should a query performable on that dataset also be persistently identifiable? How does that work, exactly? And when does something get its persistent identifier? As soon as it hits the system? Or after it's done and blessed, if it ever is?

Anyway. All of this needs to be hashed out (so to speak). It's not optional, system designers.

Once that's sorted out, citation isn't actually a huge hurdle from where I'm sitting. It's not a technical problem; it's kicking the style manuals into acknowledging data and making citation formats for it.

Usage, now, that's a hurdle. It, too, is utterly necessary for cultural reasons, however. The culture of academia looks kindly on impact measurements, even hopelessly faulty ones. Somehow or other, research impact has to be measured for researchers' careers to advance. Data are no exception.

(In my professional neck of the woods, systems designers ignored the need for usage documentation entirely too long, which has made my life as an IR manager extraordinarily difficult. I make this post in hopes of avoiding the same mistake in this new arena.)

What counts as a "use" exactly? How does "use" get harmonized over different kinds of access schemes? How does an API "use" compare with an entire download?

I don't know. I encourage systems designers not to get too hung up on such questions. Record all accesses and make the best decisions you can right now about how to present them. Yes, you'll have to rewrite the event-analysis code, probably more than once, so comment it well.

Do not, however, wait until you have all the answers to write an analyzer. If you do that, you're strangling the open-data movement in its crib. BMC Bioinformatics explains why.

2 responses so far

International Digital Curation Conference

Nov 13 2009 Published by under Metablogging, Tidbits

By way of amplifying the signal: the 5th International Digital Curation Conference is coming up in London in December. I will be there in spirit only, I fear, but I hope there will be a Twitter hashtag I can follow?

Chris Rusbridge has blogged the program.

(If I seem more scatterbrained than usual, it's because most of my spare time and brainspace is currently devoted to building a course I will be teaching online in the spring for Illinois's GSLIS. It's a "Topics in Collection Development" course, which means I have to view things through a lens I'm almost completely unfamiliar with—I don't do normal collection development, and most of what I know about it is that it scares me to death! I am designing my version to be "how coll-dev is currently changing and may continue to change." Data curation will be included, as will scholarly communication and the serials crisis, institutional repositories, digital collections, digital preservation, and similar things that I actually do know something about. Wish me luck. I will need it.)

I've gotten some good comments to yesterday's poll. Please keep them coming. I know there's more out there!

No responses yet

Poll: Where are the institutional programs?

Nov 12 2009 Published by under Miscellanea

This is a pushmi-pullyu post. I need some help with an environmental scan, so I'll get us started and the rest of you smart folks can amplify my knowledge.

I want to understand what's going on where with data curation specifically at the institutional level (no NOAA, no ICPSR, none of that) Stateside. Grant-funded is fine, though I'm doubly curious about programs that have been weaned (or are weaning themselves) off the grant money. Here are the programs I know about offhand:

Tell me what I'm missing, please and thank you.

7 responses so far

Older posts »