Oh, look, that's there

(by Dorothea) Dec 27 2010

I don't know why it didn't occur to me until now to look for an article of mine that was to be published in November, but it didn't. (Perhaps because I've had articles delayed before?)

Anyway, "Who Owns Our Work?" has been out for a month in Serials. If you are paywall-stymied, I self-archived a postprint for you.

The presentation it's based on (which I think is a little easier to follow than the article, honestly, and it's also a bit saltier, which is more fun in my book):

Enjoy. If I find time later this week, I'll drag out my broken crystal ball, because that's always fun.

One response so far

Tidbits, 2010 end-of-year cleanout

(by Dorothea) Dec 23 2010

Wow, have I ever let the tidbits folder get out of control. Bad me!

I've moved from del.icio.us to Pinboard for the nonce. With the del.icio.us diaspora in full swing, the best way to get me a link of importance is probably to comment here. Speaking of which: I'm getting reports of would-be commenters turned away with 403s. If that's you, would you please drop me a line at dorothea.salo at gmail? I'm trying to get a handle on how widespread the problem is and what might be causing it. Also, I'm really sorry it happened!

One response so far

What if we threw a data-curation party and nobody came?

(by Dorothea) Dec 21 2010

So a lot of libraries and campus IT shops in the States are gearing up to deal with this whole NSF data-management plan thing. Websites are going up, would-be consultants are warming up their phones, plans are being planned (and sometimes even executed).

What if we build it and they don't come? Have we thought about this possibility?

I'm afraid my intrinsically Cassandraic nature only partly inspires these questions. We know pretty well from surveys and qualitative investigations (bug me for a bibliography if you like) that the average researcher hasn't a clue librarians can help her look after her research data. The said average researcher despises librarians, for that matter; she thinks that pukka information management can be taught to graduate students soup to nuts in a weeklong seminar, and she thinks that the real limiting skill for data management is deep disciplinary knowledge (which raises the question of why she typically leaves it to wet-behind-the-ears grad students, but…). The average researcher is dead wrong, of course (including about disciplinary knowledge being the sole limiter), but does she know that?

So let's imagine our old friend Dr. Helen Troia of the University of Achaea's Basketology department for a moment, faced with this new NSF requirement. Where will she go for help?

Well, she's probably going to call her NSF program officer first, an eminently reasonable thing to do. I hope the NSF has told its program officers to tell all the Dr. Troias of this world to look for help in their libraries—at least on their own campuses—but I'm not sanguine. What is clear, though, is that the NSF isn't going to manage Dr. Troia's data for her; at most, it'll give her a better idea of what she has to do to prove she's managing it wisely. So where does she go then?

She may also talk to her research-support office. Libraries: does your institution's research-support office know about your NSF-related activities? If it doesn't, better tell it. And she'll have a word with her local grant admin (she's lucky enough to have one) as well. Libraries: what do local grant administrators know about you?

If Dr. Troia's data are digital (not all data covered under the policy are, a point that bears re-emphasis), her next stop is likely to be her departmental IT talent. Libraries: if you are only partnering with campus IT, you may (depending on the way your campus is organized) be missing the boat. Find out where the people in small IT shops hang out, and reach out to them, too.

Now, departmental IT may well take on the job, but they are liable to do it ludicrously wrong. "Here, have some server storage space," they will say, ignoring questions of metadata, versioning, formats, organization, security, citability and other sharing issues, sustainability past grant expiration, and possibly even backup. I'm not sneering; with my own eyes I have seen a campuswide IT shop at a major research university, a shop that should assuredly know better, advertising unbacked-up storage as suitable for data-archiving needs. (No, I won't link. Yes, I am tempted to.) Again, it's a case of people not realizing what they don't know. NSF helper-elves need to be prepared to cope with that.

If departmental IT punts (as it likely should), then and only then will Dr. Troia approach campus IT. She will do so with fear and trepidation, as campus IT tends to be a Cthulhoid monstrosity, as fathomable as sunken Rl'yeh and approximately as helpful. Libraries: how are front-line tech-support finding out about your NSF-related services?

If none of the above people with whom Dr. Troia interacts points her toward the library, she won't come to the library. I wish that weren't so too. It's so. The inevitable corollary is that outreach efforts should not start with researchers. It should start with the layer of support and administrative staff with whom researchers regularly interact.

Even more cheerfully: none of this may work. We just don't know yet. We'll know much better in a year or so! Best have a plan for if it doesn't. Can you get a list of campus NSF awardees, to contact them individually? Do you have a few campus researchers who are willing to do projects with you? Can you get at the graduate students who are doing the real work?

Good luck. I think we'll all need it.

4 responses so far

In which copyright is annoying

(by Dorothea) Dec 20 2010

With all the ferment over copyright law currently, I don't understand why someone hasn't pointed out that from a recordkeeping perspective, tying copyright law to author lifespan is an incredibly bad idea, amounting to an immense research tax on would-be preservers and reusers of culture.

I was recently asked about reuse of a published photograph by Paul Regnard, a French psychologist. Don't bother with Wikipedia; he doesn't have an article there, nor was he important enough to make the pages of the scientific and medical biographical dictionaries I could lay hands on. It is possible to triangulate via Google and some fossicking about that he died in 1927.

So French copyright terms, as best I can tell, currently mirror ours (life plus seventy years), with one wrinkle: if you died in active service, your copyright term lasts an extra thirty years. (What was French copyright law like in 1927? When the term was extended to its current length, did the extension apply retroactively? Darned if I know. If anyone would like to enlighten me, feel free.) So if Regnard died in active service, his photographs are still copyrighted. If not, not.

I'm not planning to investigate 1920s service records for France. I'm just not. So there the matter rests.

Frankly, as a pragmatic tradeoff I'd accept a longer copyright term (odious though that would be) in exchange for a more precise one, such that I wouldn't have to fuss about French service records. I mean, merde.

6 responses so far

Friday foolery: the twelve months of Trogool

(by Dorothea) Dec 10 2010

Right, so, it's December and time for all those year-end recaps. Here's what we've been frothing at the mouth talking about this year, in the form of the first sentence of the first post from each month:

  • January: "Peter Keane has a lengthy and worthwhile piece about the need for a “killer app” in data management." (Wow. I still like this post. That's rare, with me.)
  • February: "Happy Groundhog’s Day Eve!" (Er, okay.)
  • March: "So the United Nations’ Intergovernmental Panel on Climate Change is mired in a rapidly heating controversy over a report that apparently let some dubious information slip through the cracks." (Funny, how this mess has colored everything else that's happened this year in data-management-land.)
  • April: "Not good at organizing your thoughts, much less your research notes?" (Did this actually fool anybody?)
  • May: "Having made it back at last from Scotland despite the ash cloud, and overcome jetlag and (some) to-do list explosion, I finally have leisure to reflect a bit on UKSG 2010." (Boy howdy, do I ever still believe this post. The cracks are showing. And widening.)
  • June: "*blows off the dust*" (Argh. Been a lot of that this year. Sorry.)
  • July: "I would be utterly remiss in my duties were I not to point out SciBling John Wilbanks’s vitally important new open-access initiative." (Heh.)
  • August: "Greetings again, gracious readers." (Oh, yeah. We kinda moved this year. It was cool.)
  • September: "Christine Borgman has a lengthy track record of saying smart and apposite things about scholarly communication and research data." (Yep. Meeting her at IDCC 2010 was a conference highlight for me.)
  • October: "In the first sentence I link to the article, making sure not to use the verboten “click here” as the link text." (I guess a lot of first-posts ended up on Friday this year?)
  • November: "A faculty friend of mine forwarded me the email following." (Where is the Wikileaks for ridiculous journal-publisher behavior?)
  • December: "This is by way of a public-service warning." (Bears repeating.)

Cheers. I do hope 2011 is a better year for blogging. This one has been rough on me—not always for bad reasons, to be sure, but even so.

Comments are off for this post

Themes from IDCC 2010

(by Dorothea) Dec 08 2010

A few themes coalesced in my head while I was attending IDCC 2010. I don't pretend they're the conference themes; in fact, I know they're not. They're just my personal aha moments.

"Set and forget"

This community understands pretty well that preservation is not a "set and forget" process. The communities this community is embedded in tend not to get that. It's a problem.

I had a good conversation with John Mark Ockerbloom about LOCKSS, which is commonly understood as "set and forget" but which is not by any means robust enough not to require auditing and active intervention.

Institutional repositories have been actively marketed as "set and forget," and we all know where that ended up. In this case, though, it's not so much the auditing that falls down (IRs are actually pretty good at hanging onto bits and bytes) as policy decisions, active collection work, and hardheaded assessment of progress. More on this in a bit.

In any case, "set and forget" is at best an empty promise, at worst an outright lie, and it's good to remember that.

Data curation community of practice

It's scary to be on the bleeding edge, as research data management clearly is. It's doubly scary for those of us who have been on the bleeding edge and suffered for it. What mitigates the fear is community, and I'm quite pleased that data management is even at this early stage building a more active and cohesive community of practice than institutional repositories have ever managed to do.

Reasons for this include the absence of normative software communities in the data-curation space; the potential IR community fragmented quickly and completely around software choices. The enormity of the job also helps. Everyone thought (wrongly) that IRs could be built and maintained by one person with one hand tied behind her back, so where's the need for community? Everyone now thinks (correctly) that research-data management is much larger than any one person, any one library, or even any one institution. We're all looking for partners, collaborators, agony aunts.

And even better, we're finding them.

Open access is losing libraries and librarians

Library involvement in the open access movement in the United States is in trouble. I don't think the movement has entirely come to grips with that yet, but it is. As the "Cassandra of open access," I'd be remiss if I didn't say something.

I see a fair few symptoms. SCOAP3 is going down to the wire. COPE is floundering. When asked to pony up money for open access, I hear librarians and library administrators saying "Look, I thought OA was supposed to fix this budget crisis; instead, it's making my budget picture worse. In fact, when I go ask for more money for serials, I get asked why OA hasn't fixed the problem yet. Go find some other sucker; I'm done propping up this sad little sham of yours."

If that's not bad enough, OA is quietly, steadily losing its footsoldiers in libraries whose institutions don't boast OA mandates. Consider my illustrious co-blogger Sarah Shreeves. Her sole responsibility used to be running Illinois's institutional repository. These days, I learned at IDCC, she is also running the new Scholarly Commons and co-chairing the campus data-curation initiative. These initiatives eat up so much of her time that the IR has of necessity taken a bit of a back seat. I don't talk about my own job here (I really can't), so I'll just say that she and I have been professional twins for a long time, and we continue to be so.

This is great for those footsoldiers, mind you. Being an OA and/or IR footsoldier in the average US academic library is abject misery. The open access movement has never helped, or even taken notice that there might be a problem; when it's not proclaiming loudly that it doesn't exist to solve library problems, it's openly insulting libraries and librarians over a variety of so-called derelictions. This demoralizes the footsoldiers, as well as damaging their credibility and effectiveness within their institutions and their libraries.

The fair few footsoldiers I know are bright, talented, energetic people. I'm frankly thrilled their libraries are recognizing that and finding better professional situations for them. The OA movement, however, shouldn't be as thrilled as I am.

A little while ago I helped coach a friend into a job running a brand-new IR. I encouraged my friend to grill the employer pretty hard on what they were planning to do with the IR—the two questions I've been advocating for years, "what do you want?" and "how are you going to get it?"—and what I learned is that OA is so far down the list (there is a list, at least) for that library that it might as well not be there at all.

In its way, the very success of Open Access Week is a symptom. Listening behind the scenes and reading between the lines this year, I heard a fair few isolated librarians struggling against their own libraries to put together anything at all for the occasion. Several needed the OA Week banner ("this is an international event! it's embarrassing not to participate!") to goose their libraries into action. In addition, I got a distinct sense that some libraries put on an OA Week event in order to tick off the "did something about OA" tickybox for the year, in essence giving themselves an excuse not to do anything else.

I don't have any bright ideas, I'm afraid. I do believe that ARL/SPARC needs to turn its attention to stiffening its membership's collective spine, and giving them a clear and actionable roadmap to follow.

It's quite possible, even likely, that the OA movement will react to these symptoms with a collective shrug; that's certainly how they've treated libraries heretofore. I'm too personally demoralized by the whole mess to argue. The proof of the pudding, and all that. But if US IRs start folding and COPE doesn't make it and institutional mandates stop happening or existing ones backpedal, don't say I didn't warn you.

No responses yet

The Four Sons of digital curation

(by Dorothea) Dec 07 2010

So I wanted to put in my two penn'orth on this question on DHAnswers about best-practice guidelines for data in the humanities, but what I have to say is a little askew of where that discussion seems to be going. I'll say my piece here, then, and link from there.

At CurateCamp yesterday, the discussion of a curation community of practice suddenly took an extraordinarily technogeeky turn. By way of bringing it back to earth a bit, I pulled out a well-worn analogy that I've used before in other contexts: the Four Sons parable from the Pesach service.

The First Son in the Pesach parable asks his father to describe to him in exhaustive detail all the observances of Pesach and all the stories behind those observances, so that he can do everything correctly and pass on the knowledge to his descendants. Everybody in the CurateCamp room, myself certainly included, was a First Son. We can't get along without our First Sons. The peril of First Sons, though, is that they tend to lack perspective and get caught up in pilpul.

This is exactly what happened at DHAnswers. A couple-three First Sons got to duking it out about the value (or lack thereof) of SGML/XML markup. Derailed the entire conversation into a tiny, tiny corner of a very big question. It's what was happening at that particular moment in CurateCamp, too. It happens a lot, and it's a problem.

The Second Son in the Pesach parable asks, "What is all this to you?" By saying "to you," and not "to us," the Second Son intentionally and hostilely places himself outside the community, treating it as a zoo full of weird and occasionally unsavory animals. He doesn't understand what's going on and will have to be talked into caring. In universities, a lot of Second Sons live at high echelons of library, IT, and university-wide administrations. Grant funders have a fair few of them too.

The Third Son asks only, "What is this?" He's not hostile, but he's utterly clueless, not even understanding what he doesn't know. I've met Third Sons in large numbers among faculty. As the Pesach fable explains, Third Sons need simple and straightforward explanations that they can follow even if they don't really understand the problem domain.

The Fourth Son does not even know how to ask, and he exists in large numbers among faculty as well. The Pesach parable insists upon outreach.

The Third and Fourth Sons are why so very many early digital projects are no longer extant. The Third and Fourth Sons are the ones who perpetrate all the wrongheaded antipatterns DHAnswers has so kindly and snarkily collected. The digital humanities cannot progress among the humanities generally until the Third and Fourth Sons receive more and better guidance—emphatically including warning them away from common antipatterns!

Here's the thing. Too many approaches to digital curation, even to explaining digital curation, are aimed at First Sons. This is self-limiting, counterproductive behavior. Whatever the ACH and the NEH do to address data management among humanities research, it needs to be aimed at all four sons.

Comments are off for this post

Idiosyncrasy at scale: digital curation and the digital humanities

(by Dorothea) Dec 07 2010

John Unsworth, Illinois. "Idiosyncrasy at scale: digital curation and the digital humanities."

Can't remove ambiguity in the humanities (the way you can in chemistry)! We'd remove everything that matters. This can make it hard to talk about humanities "data" (is there a thermometer for the zeitgeist?). Humanities data are idiosyncratic because the people who make them are.

Research methods are changing as traditional objects of humanities study (e.g. diaries, correspondence) become born-digital. Still have to "tame the mess," recognize that mess has value, including as a mess. Is departure from the norm an "error" or a "data point"?

"Retrieval is the precondition for use; normalization is the precondition for retrieval." (Not sure I agree with this! Techniques exist to deal with messiness.)

Six laws to give us pause:

  • Scholars interested in particular texts.
  • Tools are only useful if they can be applied to texts of interest.
  • No one collection has all texts.
  • No two collections are format-identical.

Therefore: humanities data narratives include normalization (of "Frankendata:" broadly aggregated but imperfectly normalized data). Lots of different kinds of normalization (spelling, punctuation, chunking, markup, metadata).

Example: MONK project, using EEBO and ECCO within the CIC. (Me, on soapbox: This. THIS. is the collateral damage from "sustainability" initiatives that impose firewalls around content. If you're not in the CIC, too bad so sad, you can't use these data.) Lots of data-munging which I won't recount.

Example: Hathi Trust, now available through API. Will be central player in developing research uses for digitized texts. Doing preprocessing/normalization blows up storage space necessary by 100x. There will be a research center established for working with this corpus.

Can we crowdsource corrections, a la GalaxyZoo? People are interested and willing, it can't be automated, and we need the help.

How do I keep my solution from becoming your problem? Association for Computers in the Humanities trying to crowdsource some best-practices recommendations for humanities researchers on managing their digital/digitized collections. Immediate conflict on DHAnswers site: to use markup or not to use markup? Practical upshot: when do we have usefully shareable data? When should we stop messing with it so others can use it? What's data and what's data interpretation, and what do we do when they coexist in the same marked-up text?

Humanities data is bigger than books! Books are the tip of the iceberg. NARA strategy for digitizing archival materials: they have 5x the pages of what's in Hathi Trust, in much less tractable forms than the books Google/Hathi is working on. And that's just one archive! We'll have to learn how to manage this kind of scale.

3 responses so far

Data and curator: who swallows who?

(by Dorothea) Dec 07 2010

Barend Mons: Data and Curator... who swallows who?

Curation strategy: cross our fingers and hope? Doesn't work! His mantra: "it's criminal to keep generating data without policies to deal with it!" Need to get the collective brain involved.

If you measure anything in -omics, you end up with a Big Ball of Mud. "Ignorance-driven research:" measure, get lots of data, get a result if you're lucky. These days, it becomes "BIGNORANCE driven research," the goal of which is to find some kind of signal in all the noise.

Everybody wants structured data, but nobody wants to do structured-data entry! We need to figure out how to get from messy free text to structured data. Note: people WILL do structured data entry if they see how it helps them. (Metadata librarians take note!)

Lots of wikis trying to solve this problem, but what ends up happening is people repeating each other's assertions rather than checking and correcting errors.

Theme emerging: we only need a tiny tiny share of people's online attention to do a LOT of science! Question is how to earn that share.

Knowledge discovery by computers requires computer-operable data. (No big surprise, but it bears repetition.) Dirty data comes from all kinds of datamining: Web, articles, etc. Then clean it up on the wiki and add URIs to evidence, publications, etc. Put the result out as RDF, then use computer reasoning to adduce insights and guess at their reliability. Hoping also to store but not reason over negative results. Soon they'll be able to track "nanopublications" and make sure people get credit.

Partnerships arising to do things that are too complicated for a single researcher or organization to do. Using ORCID, VIVO, etc. to refer to people.

Summary: we need to remove ambiguity and redundancy; we need computer-reasonable data so that we can throw grid computing at it; we need to involve a million minds in curation; we need data publication (not just sharing) so that data become citable; we need data-citation metrics; we need standard setting from the bottom up.

Comments are off for this post

Liveblogging Kevin Ashley's talk at #idcc10

(by Dorothea) Dec 07 2010

(Wireless is terribly wonky at the International Digital Curation Conference, so I'm going to try liveblogging instead of Twitter.)

Kevin Ashley, Director, Digital Curation Centre, "Curation Centres, Curation Services: How many is enough?"

In the US, answer is 3: roughly one center per 100 million people, one per $120 billion in research funding. D2C2 at Purdue, UC3 at California, DRCC at Johns Hopkins. Does this mean the UK has too many?

How many services are there? Many per center! Who is being served?

Picture is actually considerably more complicated: many centers across the US, doing different things for different disciplines and institutions and people at different points in data lifecycle. E.g.: national libraries, national subject data centers, international subject data centers, university libraries, government data archives, etc.

Each actor has a different idea of where they sit in the DCC data lifecycle. Some focus on access/use/reuse of data, especially those that focus on highly-curated information and want to see a large audience before they take in a dataset. Institutional actors tend to engage earlier on, in the appraisal/selection stage; they won't take everything, but they'll take more and more diverse datasets. Others will pitch in at the very beginning, helping people with ideas make plans for durable data.

Motivations to help out include: "data behind the graph," reuse value of data for research, "data as record" (in the records-management sense), data reuse in education, increase the value of data via data mashups.

Given the complex landscape, it's hard for researchers to figure out where to go to get help, even those who desperately want to do the right thing! They need some kind of decision tree. Kevin suggests accepting that data has different homes at different points in the process; make that easy, and help people point to data wherever it happens to live. Particular problem when publications refer to small slices of a bigger data source; the connection between the slice and the original dataset can get lost.

Various sources of guidance for researchers and service providers; also potential peer reviewers of grants (how do THEY tell a good plan from a bad plan?).

DMP Online: walkthrough tool for researchers; can be adapted to almost any funder policy worldwide. Rule-driven, structured, generic questions. The same tools can aid peer review of grant applications, because everything is reduced to a common template, making plans easier to compare. Again, how many of these services do we need? Is DMP Online enough?

"In preparing for battle, I have always found that plans are useless, but planning is indispensable." Good thing to recognize! Plans always change, but the planning process is still useful.

IDCC presentation two years ago on what university libraries can do:

  • raise awareness of data issues (improving service to research, not just teaching)
  • leading policy on data management at institutional and national level
  • advising researchers on data management
  • working with IT to develop data-management capacity
  • teaching data literacy to research students
  • developing staff skills locally, reskilling/retraining
  • working with LIS educators to identify and deliver appropriate skills in new graduates

Some of these initiatives have been more successful than others! IT/Library interface often troubled or nonexistent. We're not teaching graduate students, or our own library staffs. Working with LIS instructors is inconsistent. But we're doing pretty well on policy and consciousness raising. (Kevin is talking about his own UK context; I think the US would tick different checkboxen.)

Question: should disciplines or institutions take on the data-curation problem? Pros and cons in both directions (I won't copy the slide; it's complex). Disciplines tend to run on short-term funding and have a narrow view of usefulness. Institutions don't tend to have the depth of knowledge.

Institutions need to know what's theirs, know where it is, know what the rules are (who can see it, who assesses it, who discards it, when that changes). It's part of the institution's public portfolio! Marketing it is also an institutional responsibility.

Current decision points for UK Research Data Services: should every institution do this? what are the rules for new subject repositories, as the data landscape changes? what should be done nationally, what locally? drive or follow the international agenda? Where do institutional research administrators fit into all this; can they put aside turf battles for the efficiency of collective action?

What is the impact of research-data management and public data access? IMPACT. Increased citation rates (see Piwowar et al -- hi, Heather!); the 45% of publications in the sample with associated data scored 85% of the citations. Correlation is not causation, but the link is pretty suggestive. Shared social science data achieves greater impact and effectiveness: more primary publication, more secondary publication, findings robust to confounding factors. Formal sharing is better than informal sharing, which is better than no sharing at all. These numbers are persuasive to evidence-based researchers; we need to bring this to their attention! Also need more investigation across disciplines.

Definite demographic differences in who will share data. Women more likely than men; northern US likelier than southern; senior researchers more likely than juniors. (But the seniors are TELLING their juniors not to! Selfish and counterproductive, IMO.)

Another impact: reuse. "Letting other people discover the stories our data has to tell." Teaching journalists to mine data: "fourth paradigm for the fifth estate." Push to make government data more open allows savvy journalists to find stories in released data. They're being taught Python, analysis techniques, etc. Sometimes they'll get it wrong; we'll have to live with that.


  • Data is often living; treat it that way! (This is a serious weakness in the OAIS model IMO.)
  • More data in the world than is dreamt of in scholarly research, Horatio!
  • Hidden data is wasted data.
  • International collaboration is essential.
  • We have a duty to examine and promote the benefits of good data management and data sharing.
  • Three centers in the US is not enough!

From Christine Borgman: May not be time yet for rigid policies or too much structure from e.g. NSF. This is an experiment in what the scientific communities themselves will come up with, and what the response will be. Let's hang back and study the results. (I agree wholeheartedly!) Response: No, we don't want rigid rules, but we can help them work toward best practices and structured thinking about what a data-management plan is. And some agencies can legitimately set constraints (e.g. use our data centers) and monitor compliance. Fundamentally, though, right now is about getting people used to the whole idea of data management.

Q from Department of Energy: government is afraid of cost of data curation, uncertain of benefits. The more evidence of impact, the better! A: Absolutely agree! We need to measure and market benefits.

2 responses so far

« Newer posts Older posts »