Archive for the 'Research Data' category

Cowboy and centralized research IT

Feb 08 2011 Published by under Research Data

The question of research-IT provisioning came up in my post on data-security horror stories. I saw some confusion from readers about it, and it's worth examining in detail for other reasons, so here goes.

So let's imagine Achaea University for a moment: immense, a diverse research agenda across many disciplines, lots of grants coming in, but some areas (often but hardly exclusively in the humanities) with no grant money incoming at all. How does Achaea U provision researchers with IT tools and services?

Achaea U doubtless has a central IT unit. At a minimum, it handles networking, campuswide administrative IT (payroll, HR, authentication/authorization, likely the course-management system, perhaps calendaring and email if those haven't been outsourced), and a lot of front-line student- and staff-facing IT (computer labs, campus wireless, helpdesk, webspace, basic web-accessible storage, etc). It may or may not have a learning-technology unit.

It almost certainly doesn't have a research-IT-specific unit. Such research computing services as it provides are of two types: repurposed other services (e.g. webspace), or pay-to-play services (e.g. specialized development teams). Big storage, if it exists, is almost certainly pay-to-play; you pay as long as you keep data on central IT's systems, and if you don't pay, central IT blows the data away. Such research-type services also tend to be "enterprisey" in their technical provisioning—which combined with pay-for-play means "serious sticker shock" for the average researcher, even the average well-funded researcher.

Services also tend to be lowest-common-denominator. If you have special needs, such as preservation past grant expiration or diamond-hard security? Tough noogies, chum. Central IT offers what central IT offers; you can take it or leave it. You can yell at central IT all you like that they don't know what the hell they're doing (and they may very well not; insular central IT units can and do gin up services that are convenient for them to provide, while not convenient at all to the intended user). Doesn't matter. Central IT offers what Central IT offers. Take it or leave it.

Most researchers leave it, which means no economy of scale, which means these services cost central IT even more than they need to—and since central IT is pay-to-play, well…

So Achaea U has a lot of other systems running research-related IT. For example, Achaea U does a fair bit of what's called "grid computing" (which has other guises too, but let that go for now). That's not run through central IT, because central IT was too big and ponderous and lowest-common-denominator to jump on that need (it's very hard, organizationally, for central IT to greenlight a service that not everybody on campus will use). Engineering or comp sci owns the grid, or it may have spun off into its own (likely pay-to-play, depending on the status of its internal grant funding) research/service enterprise.

And then we have the other end of the scale: a poorly-funded lone-wolf researcher limping along via a Linux server installed on a dusty beige consumer-grade box under his desk. If it breaks, he's humped, because it was set up years ago by a grad student who has since graduated, leaving no documentation behind, and he doesn't entirely know how it works. It hasn't broken. Yet. Is it backed up? Who the heck knows? Has it been hacked? Who the heck knows? Who the heck knows which networks it's even connected to, for that matter? The researcher sure doesn't. But he knows that his server (plus whatever free-to-him web services he tacks on to his processes) is cheaper by a factor of ten (maybe even a hundred) than equivalent computing provision from central IT! This, folks, is what I mean by "cowboy IT." Yee-ha! And there's a lot of it, scattered all over Achaea U! Yippee-ki-yi-yay!

It is, as I said, a continuum. Based on what's said in the Inside Higher Ed article, Dr. Yankaskas was very close to the cowboy-IT end. Somewhere in the middle, Achaea U has a few research-IT units that work on soft money for small or large groups of researchers. These units are more nimble, discipline-savvy, and responsive than Achaea U's central IT, and they're likely just as competent or more so (especially considering how little central IT knows about research-computing needs); the downside is that they're not as richly-funded and their funding is always in danger, so they probably cut some corners. The worse among them are no better than straight-up cowboy IT; part of the problem is that their staff may be selected by researchers who don't know jack about IT (as clearly happened in Dr. Yankaskas's case).

Plenty of Achaea U researchers, it must be said, can't even muster a cowboy-IT setup, when lack of outside funding combines with lack of skill. They are utterly shut out. Neither central IT nor research-computing units want them because they have no grant money to toss in the pot. The library may do what little it can, particularly for humanities scholars, but it's not enough.

So how do researchers get away with cowboy IT? Well, honestly, nobody's ever looked. It's that simple. And nobody looks because nobody much cares—until there's a huge, embarrassing screwup like the Dr. Yankaskas affair. (If this seems to resemble the laissez-faire IT environment that used to exist for social-security numbers in US universities? Quite right. Same causes.) Classic case of externalities: cowboy IT creates risks, sometimes serious risks to the researcher or even the institution, but mitigating the risks isn't perceived as important (and is known to be expensive) until there's a sudden crisis.

I expect the NSF data-management plan process to expose a shocking amount of cowboy IT in US science research, from the Achaea Universities among us to industry all the way down to the lone-wolves. I also expect the NSF will start to indicate gently that cowboy IT is not acceptable practice… and to become rather less gentle about it over time. This means that researchers will have to internalize risks they hadn't previously worried about, or they'll wind up like Dr. Yankaskas.

I don't entirely know what campus research-IT infrastructures will emerge from this. I wouldn't be celebrating if I worked for central IT; I have serious misgivings that central IT in its ongoing ignorance can even do this right. I'd rather see a mesh of the middles, growing collaboration among research-specific IT units to expand their services, service models, and funding sources to campus cowboys and have-nots. That's a tall order, though; funding models aren't clear, and these units think of themselves as independent fiefdoms, rarely valuing collaboration because of its added process overhead. It doesn't help that central IT will often fight to keep such a mesh from emerging, viewing it as a threat.

So we'll see. The bottom-line truth is that Achaea U will have to do better at research-IT provisioning in the next decade, or it'll start losing grant dollars to universities that work out how to do it right. Yippee-ki-yi-yay.

13 responses so far

Data-security horror stories

Feb 04 2011 Published by under Research Data

I'm afraid we're going to see more data-security horror stories like this in the next few years. It's truly horrific for everyone involved.

Rather than point fingers, because there are multiple levels of epic fail in this situation and nobody comes out smelling like roses, I'll try to pull out some more-or-less depersonalized morals-of-the-story:

  • Knowing why confidentiality is important is not the same thing as knowing how to ensure it, particularly in a networked computing environment.
  • Cowboy research-IT installations and their staffers must soon expect a fair bit more scrutiny than they're used to with regard to many important data-management questions, data security hardly least. These risks may well swing the pendulum away from cowboy IT (widely perceived as cheaper) back to more centralized, accountable systems and staff.
  • The buck stops at the PI. This means that the practice of leaving computing to the young ’uns and part-timers is not going to cut it any more.
  • If it's this bad in biomedicine, which is well-funded… I'm scared about everything else. Really. I may never fill out a survey again. (Okay, that's just because I hate surveys and believe that much too much lazy survey research is done, not least in librarianship.)
  • Policy, policy, where is the policy around data issues? It's years behind where it needs to be, that's where. And don't talk to me about IRBs (or NSF grant reviewers, for that matter; this is a serious and I hope temporary weakness in the NSF data-management plan model). IRBs are made of PIs, not the necessary gimlet-eyed informaticists and IT-security pros. If you've ever been on an IRB, be honest: would you have thought to ask about IT staff competencies?
  • Anybody who reduces research data management to "storage and backup" needs repeated applications of cold water and horror stories like the above one until they come to their senses. It's more complicated than hardware, people. Much more.
  • Ditto anybody (hello, librarians! hello, OAIS model!) who thinks that data management starts when the data are final.

Data security is serious business, especially now that reidentification risks have entered the picture. If you do human-subjects research, or work with any other sensitive data in digital form, take security seriously before you get caught flatfooted.

7 responses so far

The One Schema

Jan 31 2011 Published by under How Libraries Work, Research Data

I grumbled on FriendFeed today that I wish folks (IT folks in particular) would understand that there is no single metadata schema that works for every kind of data in every form in every situation. If you're building a data repository intending to store many kinds of data from many disciplines, it had better have a metadata model that accommodates many different vocabularies.

Bill Hooker promptly stepped up to the plate with the following dictum (slightly edited by yours truly):

Three schemas for the astronomers under the sky;
   Seven for the urban planners in their halls of stone;
      Nine with which biologists comply;
and ONE for the Librarian on hir Dark Throne:
In the Land of Library, where the metadata lies.
   One schema to rule them all,
   One schema to find them;
   One schema to bring them all;
      And in the repository bind them.
In the Land of Library, where the metadata lies.

I just named my Aeron chair the Dark Throne, y'all.

7 responses so far

Can it be? A metadata standard that makes sense?

Jan 19 2011 Published by under Research Data

I am notorious for hating library metadata standards and standard-like objects. Hate MARC. Hate Dublin Core with a great and wonderful hate. Hate OpenURL. Hate EAD. Hate OAI-PMH and OAI-ORE. Bring me a metadata standard, I'll usually find something to hate.

What does it mean that I like the DataCite Metadata Scheme? Am I losing my edge? Going over the edge? What?

Or it could just be that the DCMS is a sensible minimum that solves the problem at hand (identifying and citing digital datasets) without gobs of cruft or gobs of oversimplification. They've also acknowledged the need to revisit and change the scheme over time, and are working on how that will happen (Open Archives Initiative, I am training laser-eyes on you).

DCMS is not perfect; in my opinion, they'll need to go beyond DOIs to handles and ARKs and PURLs. (Yes, I know all DOIs are handles; not all handles are DOIs.) But for a first cut, it's pretty darn good, and it'll stay that way if they can resist the temptation to cruft it up. Good job, standardistas!

Comments are off for this post

Syllabi (and how rapidly they become obsolete)

Jan 18 2011 Published by under Research Data

So I promised I'd throw my syllabus up for folks to look at, and voilà, I have done so.

A few foot-shuffling words about it. This is a library-school syllabus. I am teaching future librarians, archivists, and records managers. I therefore make no apology for the library focus in this syllabus. If approached to work on an informatics course for a science department, I would come up with a very different syllabus indeed. (I'm up for doing that, by the way; just not alone, unless it's a linguistics or digital-humanities course where I have sufficient disciplinary background not to make a total idiot of myself. Don't ask me to teach cheminformatics all on my lonesome, though; no can do. Find me a cheminformaticist or even a chemist to work with, and I'll see what I can accomplish.)

I haven't cribbed (much) from other curricular materials out there. Possibly I should have; I ran short on minutes. Part of it, though, is that I'm an ornery cuss with a full set of my own ornery notions about what newbie librarian data-managers need to know. That set will change over time! I'm already feeling sorry that I didn't stick in a day on personal digital archiving, and I may yet do so, since I cautiously left a free day in the syllabus.

Part of it is also that curricular materials tend to assume a whole program's worth of courses, rather than just one course. If I paid too much attention to DigCCurr, feelings of utter inadequacy would have prevented me from writing a syllabus at all! There's only so much I can do in a single semester.

The fun bit (for certain values of "fun") of writing syllabi is how rapidly they obsolesce. Teaching and working in a rapidly-growing, rapidly-developing area, as I remarked on Twitter this morning, is an exercise in constant "whoa, hey, look at THAT!" moments. Today is the first official day of class for me (although since this class is all-online and I opened it up late last week, several enterprising students have already dug in, and I even have a couple of first-week homework assignments turned in already!), and what should show up in my feedreader but an entire issue of D-Lib Magazine devoted to research data. Total facepalm moment. If this issue had been out when I was syllabus-writing, half of it would have gone in, I'm sure!

So, you know. I do what I can do. I posted a "whoa, hey, look at THAT!" note to the course-management system. I expect I'll post quite a few more of those, as the semester progresses!

6 responses so far

Syllabus machine

Jan 12 2011 Published by under Metablogging, Research Data

Sorry for the radio silence this week; I thought it might be a good idea to finish my syllabus for this spring's digital-curation course, seeing as how class starts next week and all.

It's pretty much done, finally; I'm working on stuff in the course-management system now. I do intend to post the syllabus online when I'm committed to it sure I'm finished. Since this is an all-online course, I'll be doing a fair few audio lectures and screencasts, and I may post a few of those as well over the course of the semester. (Not all of them by any means; the classroom is a sacred space where I can tell horror stories and not get in trouble, but Book of Trogool is not a sacred space.)

This is the first time I've taught this course; it should be a pretty wild ride!

Also, how in the world did anyone do syllabi before there were DOIs? I love DOIs. Find the article, copy-paste the DOI into the syllabus with http://dx.doi.org/ in front of it, done. All the messy access bits get dealt with by library proxy servers and CrossRef infrastructure.

4 responses so far

How to make a digital preservationist cry

Jan 04 2011 Published by under Research Data

Put your thesis on a 5 1/4" floppy disk. Put the floppy in a floppy plastic pocket. Masking-tape the plastic pocket onto the inside of a hanging-file folder (containing the paper copy).

Leave the folder with the floppy pocket with the floppy disk in a file cabinet.

Do all this in 1985. Do not look at the folder again until 2011.

Somebody pass me a tissue. My eyes are watering here.

12 responses so far

Looking toward 2011

Dec 30 2010 Published by under Open Access, Research Data

Before I get to crystal-ball-gazing, I have to point out my track record, because it's really quite bad. Not only am I on record with a major prediction that didn't come true ("IRs in the US will fold"), I quite failed to predict a number of things that did, from Harvard's OA policy to California telling Nature Publishing Group to go suck eggs.

My brain looks at systems. That means I consistently miss outliers, game-changers. I also don't always calibrate my guesses on the durability of systems right.

So with that said, here are some things that wouldn't surprise me a bit in 2011.

  • SCOAP3 eeks through; COPE backpedals or folds. What the open-access movement is facing in 2011 is a world where most of the low-hanging fruit has been plucked. Progress isn't easy or obvious any more (if it ever was), and it can't be made by the pioneers, entrepreneurs, and other earliest-of-early adopters. IRs are no longer fashionable (in the States, I add for my international readers). Gold-OA funds have to contend with the ever-widening maw of Big Deal renewals. My sense of attitudes among research-library administrators, as well as rank-and-file selectors, does not favor COPE's success or even survival.
  • Academic samizdat sees a real copyright lawsuit. Those creeps over at Attributor may well be the instigators. If they're smart, they won't actually sue a university, much less a library; they'll go after Mendeley or something RapidShare-ish, to keep the slumbering faculty behemoth safely abed. It's not out of the question, however, that some tiny school somewhere with grossly inadequate or nonexistent "electronic reserves" protections (and I've seen such schools firsthand; the culprit, aside from faculty themselves, is generally a boundlessly clueless IT shop) will be the target.
  • The initial campus NSF flurry will sputter. I'm worried about this myself. I encourage libraries and IT shops building data-management services on the strength of the NSF's plan requirement to diversify, and that quickly. Find non-NSF people to help. Do a survey or focus-group study to demonstrate non-NSF-related data-management needs. Pay some attention to the digital humanities. Do not plan to rely on a flood of NSF applicants; that flood is highly unlikely to materialize. There's plenty of work to do, don't get me wrong; most of the work just doesn't happen to be NSF work.
  • FRPAA won't make it this time either. Sorry. Maybe next time. Or maybe the NSF won't wait for Congressional cover, though I emphasize the "maybe" on that one.
  • Some chemistry department somewhere will drop ACS accreditation because the institution can't afford ACS journals. I have to admit, I have a little inside info on this one. But it's only logical, really.
  • A bare handful of Big Deal renewals will blow up, à la California and NPG. This is likely to happen in the full glare of the public eye, despite publisher wishes and publisher NDAs, because Big Deals are just that big and that noticeable. Don't be gleeful about this, libraries, because…
  • Faculty will start a lot of "why don't those damn librarians…" grumbling. If you'd like to hear some, pre-2011, have a listen to Amanda French and Tom Scheinfeldt in this episode of the Digital Campus podcast. Those damn librarians. Why don't they just fix this? Where's their damn spine?
  • An IR's gonna fold. Yes, all right, I was wrong when I said this the first time, and I wouldn't be surprised to be wrong again. But I'll say it nonetheless. I see too many libraries who opened IRs on a wing and a prayer without adequate planning or even a sensible collection-development policy. Let's face it, folks: in the absence of mandates, the OA-via-IRs experiment failed. Let's also face that libraries can't run (much less re-run) expensive experiments these days. Result? Some IR somewhere will face a big budget ax. (Disclaimer: those who know me professionally know that the IR I run is getting merged out of existence. That doesn't count for purposes of this prediction; that would be cheating.)
  • We'll see a bare handful more campus or patchwork mandates. I don't think we've quite seen the end of the post-Harvard wave. I do think we're close to that end—and there won't be a second wave, not without a lot more work and evangelism than the open-access movement is currently mustering. There just haven't been enough mandates quickly enough to start up an academic fashion.
  • Another major university press will merge with its library or fold. I haven't a clue which one, but given the continued bumbling confusion among provosts about scholarly publishing being able to cover its nut (hint: it can't), and the continued denial among the humanities that the economics of monographs no longer hold water (hint: go all-digital, perhaps plus POD, or die), this is all but an inevitability. We'll see a few more small scholarly presses fold as well.
  • Crowdsourced data-analysis projects will increase, and pick up more good press. GalaxyZoo alone practically guarantees this one, but the humanities are charging forward with some great transcription projects as well.

It'll be a challenging year, no doubt about it. Let's meet it with fortitude.

8 responses so far

What if we threw a data-curation party and nobody came?

Dec 21 2010 Published by under Praxis, Research Data

So a lot of libraries and campus IT shops in the States are gearing up to deal with this whole NSF data-management plan thing. Websites are going up, would-be consultants are warming up their phones, plans are being planned (and sometimes even executed).

What if we build it and they don't come? Have we thought about this possibility?

I'm afraid my intrinsically Cassandraic nature only partly inspires these questions. We know pretty well from surveys and qualitative investigations (bug me for a bibliography if you like) that the average researcher hasn't a clue librarians can help her look after her research data. The said average researcher despises librarians, for that matter; she thinks that pukka information management can be taught to graduate students soup to nuts in a weeklong seminar, and she thinks that the real limiting skill for data management is deep disciplinary knowledge (which raises the question of why she typically leaves it to wet-behind-the-ears grad students, but…). The average researcher is dead wrong, of course (including about disciplinary knowledge being the sole limiter), but does she know that?

So let's imagine our old friend Dr. Helen Troia of the University of Achaea's Basketology department for a moment, faced with this new NSF requirement. Where will she go for help?

Well, she's probably going to call her NSF program officer first, an eminently reasonable thing to do. I hope the NSF has told its program officers to tell all the Dr. Troias of this world to look for help in their libraries—at least on their own campuses—but I'm not sanguine. What is clear, though, is that the NSF isn't going to manage Dr. Troia's data for her; at most, it'll give her a better idea of what she has to do to prove she's managing it wisely. So where does she go then?

She may also talk to her research-support office. Libraries: does your institution's research-support office know about your NSF-related activities? If it doesn't, better tell it. And she'll have a word with her local grant admin (she's lucky enough to have one) as well. Libraries: what do local grant administrators know about you?

If Dr. Troia's data are digital (not all data covered under the policy are, a point that bears re-emphasis), her next stop is likely to be her departmental IT talent. Libraries: if you are only partnering with campus IT, you may (depending on the way your campus is organized) be missing the boat. Find out where the people in small IT shops hang out, and reach out to them, too.

Now, departmental IT may well take on the job, but they are liable to do it ludicrously wrong. "Here, have some server storage space," they will say, ignoring questions of metadata, versioning, formats, organization, security, citability and other sharing issues, sustainability past grant expiration, and possibly even backup. I'm not sneering; with my own eyes I have seen a campuswide IT shop at a major research university, a shop that should assuredly know better, advertising unbacked-up storage as suitable for data-archiving needs. (No, I won't link. Yes, I am tempted to.) Again, it's a case of people not realizing what they don't know. NSF helper-elves need to be prepared to cope with that.

If departmental IT punts (as it likely should), then and only then will Dr. Troia approach campus IT. She will do so with fear and trepidation, as campus IT tends to be a Cthulhoid monstrosity, as fathomable as sunken Rl'yeh and approximately as helpful. Libraries: how are front-line tech-support finding out about your NSF-related services?

If none of the above people with whom Dr. Troia interacts points her toward the library, she won't come to the library. I wish that weren't so too. It's so. The inevitable corollary is that outreach efforts should not start with researchers. It should start with the layer of support and administrative staff with whom researchers regularly interact.

Even more cheerfully: none of this may work. We just don't know yet. We'll know much better in a year or so! Best have a plan for if it doesn't. Can you get a list of campus NSF awardees, to contact them individually? Do you have a few campus researchers who are willing to do projects with you? Can you get at the graduate students who are doing the real work?

Good luck. I think we'll all need it.

4 responses so far

Help with NSF data plans

Nov 16 2010 Published by under Research Data

Heather Piwowar is keeping up with NSF news at Research Remix, so while I'm still hors-de-whatever, that's a blog you should be watching.

In the meantime, institutions are starting to marshal responses. The commonest shape appears to be a one-on-one consulting service with associated website. If you're looking for help on your campus, start with a search of your library's website, then try IT, then try the research office.

Here are a few I've run across (or participated in). Feel free to add more in the comments.

2 responses so far

Older posts »