Archive for: December, 2010

Data and curator: who swallows who?

Dec 07 2010 Published by under Uncategorized

Barend Mons: Data and Curator... who swallows who?

Curation strategy: cross our fingers and hope? Doesn't work! His mantra: "it's criminal to keep generating data without policies to deal with it!" Need to get the collective brain involved.

If you measure anything in -omics, you end up with a Big Ball of Mud. "Ignorance-driven research:" measure, get lots of data, get a result if you're lucky. These days, it becomes "BIGNORANCE driven research," the goal of which is to find some kind of signal in all the noise.

Everybody wants structured data, but nobody wants to do structured-data entry! We need to figure out how to get from messy free text to structured data. Note: people WILL do structured data entry if they see how it helps them. (Metadata librarians take note!)

Lots of wikis trying to solve this problem, but what ends up happening is people repeating each other's assertions rather than checking and correcting errors.

Theme emerging: we only need a tiny tiny share of people's online attention to do a LOT of science! Question is how to earn that share.

Knowledge discovery by computers requires computer-operable data. (No big surprise, but it bears repetition.) Dirty data comes from all kinds of datamining: Web, articles, etc. Then clean it up on the wiki and add URIs to evidence, publications, etc. Put the result out as RDF, then use computer reasoning to adduce insights and guess at their reliability. Hoping also to store but not reason over negative results. Soon they'll be able to track "nanopublications" and make sure people get credit.

Partnerships arising to do things that are too complicated for a single researcher or organization to do. Using ORCID, VIVO, etc. to refer to people.

Summary: we need to remove ambiguity and redundancy; we need computer-reasonable data so that we can throw grid computing at it; we need to involve a million minds in curation; we need data publication (not just sharing) so that data become citable; we need data-citation metrics; we need standard setting from the bottom up.

Comments are off for this post

Liveblogging Kevin Ashley's talk at #idcc10

Dec 07 2010 Published by under Uncategorized

(Wireless is terribly wonky at the International Digital Curation Conference, so I'm going to try liveblogging instead of Twitter.)

Kevin Ashley, Director, Digital Curation Centre, "Curation Centres, Curation Services: How many is enough?"

In the US, answer is 3: roughly one center per 100 million people, one per $120 billion in research funding. D2C2 at Purdue, UC3 at California, DRCC at Johns Hopkins. Does this mean the UK has too many?

How many services are there? Many per center! Who is being served?

Picture is actually considerably more complicated: many centers across the US, doing different things for different disciplines and institutions and people at different points in data lifecycle. E.g.: national libraries, national subject data centers, international subject data centers, university libraries, government data archives, etc.

Each actor has a different idea of where they sit in the DCC data lifecycle. Some focus on access/use/reuse of data, especially those that focus on highly-curated information and want to see a large audience before they take in a dataset. Institutional actors tend to engage earlier on, in the appraisal/selection stage; they won't take everything, but they'll take more and more diverse datasets. Others will pitch in at the very beginning, helping people with ideas make plans for durable data.

Motivations to help out include: "data behind the graph," reuse value of data for research, "data as record" (in the records-management sense), data reuse in education, increase the value of data via data mashups.

Given the complex landscape, it's hard for researchers to figure out where to go to get help, even those who desperately want to do the right thing! They need some kind of decision tree. Kevin suggests accepting that data has different homes at different points in the process; make that easy, and help people point to data wherever it happens to live. Particular problem when publications refer to small slices of a bigger data source; the connection between the slice and the original dataset can get lost.

Various sources of guidance for researchers and service providers; also potential peer reviewers of grants (how do THEY tell a good plan from a bad plan?).

DMP Online: walkthrough tool for researchers; can be adapted to almost any funder policy worldwide. Rule-driven, structured, generic questions. The same tools can aid peer review of grant applications, because everything is reduced to a common template, making plans easier to compare. Again, how many of these services do we need? Is DMP Online enough?

"In preparing for battle, I have always found that plans are useless, but planning is indispensable." Good thing to recognize! Plans always change, but the planning process is still useful.

IDCC presentation two years ago on what university libraries can do:

  • raise awareness of data issues (improving service to research, not just teaching)
  • leading policy on data management at institutional and national level
  • advising researchers on data management
  • working with IT to develop data-management capacity
  • teaching data literacy to research students
  • developing staff skills locally, reskilling/retraining
  • working with LIS educators to identify and deliver appropriate skills in new graduates

Some of these initiatives have been more successful than others! IT/Library interface often troubled or nonexistent. We're not teaching graduate students, or our own library staffs. Working with LIS instructors is inconsistent. But we're doing pretty well on policy and consciousness raising. (Kevin is talking about his own UK context; I think the US would tick different checkboxen.)

Question: should disciplines or institutions take on the data-curation problem? Pros and cons in both directions (I won't copy the slide; it's complex). Disciplines tend to run on short-term funding and have a narrow view of usefulness. Institutions don't tend to have the depth of knowledge.

Institutions need to know what's theirs, know where it is, know what the rules are (who can see it, who assesses it, who discards it, when that changes). It's part of the institution's public portfolio! Marketing it is also an institutional responsibility.

Current decision points for UK Research Data Services: should every institution do this? what are the rules for new subject repositories, as the data landscape changes? what should be done nationally, what locally? drive or follow the international agenda? Where do institutional research administrators fit into all this; can they put aside turf battles for the efficiency of collective action?

What is the impact of research-data management and public data access? IMPACT. Increased citation rates (see Piwowar et al -- hi, Heather!); the 45% of publications in the sample with associated data scored 85% of the citations. Correlation is not causation, but the link is pretty suggestive. Shared social science data achieves greater impact and effectiveness: more primary publication, more secondary publication, findings robust to confounding factors. Formal sharing is better than informal sharing, which is better than no sharing at all. These numbers are persuasive to evidence-based researchers; we need to bring this to their attention! Also need more investigation across disciplines.

Definite demographic differences in who will share data. Women more likely than men; northern US likelier than southern; senior researchers more likely than juniors. (But the seniors are TELLING their juniors not to! Selfish and counterproductive, IMO.)

Another impact: reuse. "Letting other people discover the stories our data has to tell." Teaching journalists to mine data: "fourth paradigm for the fifth estate." Push to make government data more open allows savvy journalists to find stories in released data. They're being taught Python, analysis techniques, etc. Sometimes they'll get it wrong; we'll have to live with that.


  • Data is often living; treat it that way! (This is a serious weakness in the OAIS model IMO.)
  • More data in the world than is dreamt of in scholarly research, Horatio!
  • Hidden data is wasted data.
  • International collaboration is essential.
  • We have a duty to examine and promote the benefits of good data management and data sharing.
  • Three centers in the US is not enough!

From Christine Borgman: May not be time yet for rigid policies or too much structure from e.g. NSF. This is an experiment in what the scientific communities themselves will come up with, and what the response will be. Let's hang back and study the results. (I agree wholeheartedly!) Response: No, we don't want rigid rules, but we can help them work toward best practices and structured thinking about what a data-management plan is. And some agencies can legitimately set constraints (e.g. use our data centers) and monitor compliance. Fundamentally, though, right now is about getting people used to the whole idea of data management.

Q from Department of Energy: government is afraid of cost of data curation, uncertain of benefits. The more evidence of impact, the better! A: Absolutely agree! We need to measure and market benefits.

2 responses so far

The Fourth Jeremiad

Dec 07 2010 Published by under Uncategorized

The Christmas season seems to be bringing up a lot of talk about e-books, journal costs (namely increases), and the role of the library in the digital age. Is it because the Kindle and Nook are popular gift wish list items? It it because some library vendors are pushing bills a little later to the end of the year? I don't know.  

Robert Darnton is the  Carl H. Pforzheimer University Professor and Director of the Harvard University Library. In his recent article in the New York Review of Books, The Library: 3 Jeremiads, Darnton explains many of the things we at BOT have been mentioning and discussing for some time. (Note: Darnton is an historian, not a scientist. Expect verbiage.) I confess I had to look up what jeremiad meant, as I've studied very little theology. Darnton's choice of words is interesting. The OED defines a jeremiad as  "A lamentation; a writing or speech in a strain of grief or distress; a doleful complaint; a complaining tirade; a lugubrious effusion."

Once you wade past the Harvard promotional information, Darnton does a thorough job explaining the three main reasons why libraries are in such a a bad place right now. Just like in the most recent BOT post (which began as a comment) one of the main themes is control.  Who acquires, maintains, and relinquishes control of information is important,  as Libraries routinely give up control of many aspects of what they do for the greater good of society and culture.

What is Darnton's solution? Creating another library resource, the Digital Public Library of America (DPLA). Per his description, it is "a digital library composed of virtually all the books in our greatest research libraries available free of charge to the entire citizenry, in fact, to everyone in the world." I see this as a library-centric way to regain control of information, and wrest the monopoly of Google, et al. on the digital library of the future.

What will the DPLA hold? Of particular note is Darnton's suggestion that:  "the DPLA would exclude books currently being marketed, but it would include millions of books that are out of print yet covered by copyright, especially those published between 1923 and 1964, a period when copyright coverage is most obscure, owing to the proliferation of “orphans”—books whose copyright holders have not been located." Recent attempts to create updated policies for use of orphan works in the US have been unsuccessful. Is this a way to secure mechanisms to use them without contacting copyright holders? This is very interesting, since the GBS has includes orphan works in its scanning program and has been taken to task for its' own interpretation of copyright.   

This is interesting, because one by-product of the Google Books Project, Hathitrust, is collecting GBS content as a library-side answer to Google's monopoly of the content. Hathitrust is also starting to build a governance structure, which has more libraries joining to secure a seat at the table.  So GBS content will have another outlet controlled by libraries, at least for the forseeable future. Presumably Hathitrust will have most of the DPLA content, since the largest academic libraries have already signed onto the GBS project and their works are being scanned as I write this sentence. This also begs another interesting question: what if these GBS books are not the most important books in our culture? What if big parts of these large academic collections are filler, or less relevant for research?   

Likewise Portico and LOCKSS have created a framework to preserve most of the commercially owned journal content. While Portico isn't owned and controlled by libraries (it's owned by Ithaka, a non-profit) , LOCKSS is run by the Stanford University Libraries. The only problem, of course, is that apart from individual libraries' efforts to make information available, most of this content is still controlled to some extent through toll access and/or trigger events from the archives. Open Access journal articles, and subject repositories will have content available to the public for free, so some content is already available this way. 

The last remaining segments of research are local and special collections individually housed by academic libraries, archives, historical societies, and museums. Many of them are digitizing projects and collections as they are able to, and most are putting this information in repositories. Would this information be part of the DPLA? Maybe, although it's not clear from Darnton's article if this is the intention. For many scholars, this primary material is the most essential for their research, not synthesized monographs and summaries. Increasingly, for scientists access to research data and supplemental material is what is desired, not necessarily the final published article or book on a subject.

I think Darnton has missed the point with the DPLA, and it looks to me like duplicative work being promoted to an audience unaware of the environment surrounding digital content and access to information. So I offer a jeremiad of my own: The Library community needs to think more broadly and create broader pathways to content, rather than trying to create more specialized channels to information. 

 The concept of a national library seems outdated to me in light of today's digital environment. I frequently meet and communicate with researchers from all over the world using social networking tools and applications.Digital information doesn't have national boundaries - why create them in a library? It seems more time should be spent looking at how to create an international digital library, or repository, or link existing data and research sources rather than creating segmented units of information created for specialized audiences. There is a rapidly growing collection of digitial data, research material, and communications, all of which will be of tremendous importance to the next generation of researchers. Who will preserve this? How will it be preserved? This is what a DPLA should be thinking of, not items from 1923-1964 that will likely be saved through other scanning programs or as a print copy.

4 responses so far

Promoted comment: Lost opportunities

Dec 03 2010 Published by under Tactics

(This comment by George Duimovich to Beth's post on ACS pricing changes was so good that I wanted to see it get more play. Our thanks to George for permission to promote the comment to a post!)

Related note: I’ve often wondered whether this space (“subscription agencies”) was ready for a reverse-takeover of some kind. The service is useful, namely consolidated purchasing, and apparently the margins are supposed to be small, so it looks to many like the model should remain intact and it’s a healthy ‘win-win’ relationship.

But I see this as yet another bungled library opportunity. We created a market (“subscription agencies”) but let our passive approach let the market run against our better interests and healthier engagement. Here’s some summary points I would argue:

  • We didn’t clearly scope and demand our interests in metadata management, leaving these subscription agencies with valuable metadata that we pay them to ‘manage’ so that they can in turn sell it back to us via A-Z, link-resolver and related add-ons. This was IMHO a big missed opportunity for us to more directly control the related services ourselves. So instead, we are becoming bit-players (aka “consumers”) in the goldrush to turn metadata into dependent services that we have to pay for. Our role should be more of “investors” rather than “consumers” in this game, helping us reduce costs and have more opportunity for our clients.
  • This situation is reinforced by our bungling of the ILS space, namely, allowing the market to move towards extreme vendor lock-in, and overly segmented product offerings (where functionality has been doled out in a dysfunctional ‘pay per use’ model, rather than more organically). For example, how many “serials modules” are dysfunctional with respect to ERM, because of all the add-ons that have to be purchased to make it all work. But yet we still pay for support & maintenance on this “serials module” and can’t afford the add-ons. We allowed our ILS vendors to position managing and acquiring electronic resources as “something extra special” (and thus the plethora of add-ons to manage core business functions etc.) We can see this at play in how many of us actually manage A-Z outside of our ILS. Ditto for the “acquisitions module” too.
  • We’ve missed opportunities to be better organized on consortial purchasing, pricing activism, and stronger leadership towards open access. Again, these have factored into to how the marketplace has left us in a weakened position imho.

E-contracts and licensing figure prominently in our budgets. For many, it even matches or surpasses our entire HR salary budget. We need to be more creative and aggressively engage ourselves in testing alternative configurations in this marketplace.

No responses yet

Friday foolery: Clean the fan!

Dec 03 2010 Published by under Miscellanea

It's Friday! And technology is frustrating, that's not news.

But sometimes we really do make it harder than it should be. Here's a well-produced and hilarious example:

I'm at the International Digital Curation Conference next week. I'll certainly be tweeting (hashtag is #idcc10) and may blog.

3 responses so far

Zero-sum journal publishing game

Dec 02 2010 Published by under Open Access

Not to toot my own horn or anything—okay, okay, I admit, I like it as much as anyone when my sad excuse for a crystal ball works—but the rumblings about the economic underpinnings of toll-access journal publishing coming unmoored are getting louder.

I said:

Toll-access journal publishing will become a zero-sum game, if it isn’t already. Every dollar of additional profit for the Elseviers and Informas of this world will be ripped from the pockets of other journals and journal publishers, including scholarly societies that haven’t already signed deals with one devil or another.

And so what do I see in my feedreader today? (JUST TODAY. And I read my feedreader several times daily, so this is not buildup.)

Hate to be a Chicken Little, but that sky is looking mighty precariously balanced just now.

So what does this mean for you, O Scientist? You, O Humanist?

Well, if your libraries have successfully insulated you from serials shock heretofore, expect that happy situation to end, abruptly and horribly. (How do you know if you are well-insulated by your library? If "I have all the access I need" or "I know everybody who matters can read what I publish" have ever passed your lips, you are well-insulated. Also poorly-informed, but that is a common symptom of well-insulatedness.) Chances are good you're going to lose access to some core literature in the next year or three, and it could be a lot if a Big Deal suddenly evaporates. Interlibrary loan will not help you. Academic samizdat is chancy (and I wouldn't be surprised to see more attempted crackdowns on it, even lawsuits). Nobody's going to throw more money at libraries. There is no more money.

We'll also see more protests of the California-versus-Nature-Publishing-Group ilk, and maybe some more transparency about library budgets. Not nearly enough more; libraries are both naturally timorous and politically embroiled. But more. Consider participating in the protests, researchers. Publishers listen to you in a way they don't listen to us.

I also said:

No one seems to agree with me on this, but I grow more confident by the day: small, low-subscriber-base journals at Big Deal publishers are in deep trouble as well. They add overhead but no especial additional profit, so they are obvious cost-cutting targets. Perhaps a journal massacre won’t happen right away; EBSCO particularly still seems to be on an acquisitions spree. I do believe it will happen, though—and when it does, some of those journals will re-form as gold-OA, while most of the rest will simply fold, publisher-hopping not being an option.

And I still believe this, even if no one else does. Your favorite publishing outlet may not be long for this world. Better look for some backups. I don't actually consider this necessarily a bad thing. I'm not fond of the idea that any article can get published somewhere, because frankly, a lot that is published shouldn't be (this is based on my own professional reading, but I hear it's the same in other fields). I also think there's way too much overhead involved in duplicative journals and repetitive article-submission patterns. We might come out of this mess with a more rational and streamlined system; I sure wouldn't complain.

Humanists, I don't actually expect much more plundering of monograph budgets. Mind you, if it would help, it'd probably happen, but the amounts left for monographs are a drop in the bucket compared to what serials publishers want to bleed us for. I do expect more university-press closures and more presses becoming part of libraries.

Librarians, I expect that we will be under the gun. We know from Ithaka that our most-prized service is access facilitation, meaning that faculty see us primarily as wallets. When we can't do that any more—yes, even though that's fundamentally not our fault—we can expect many more faculty to wonder why the hell we exist. We'd better damn well have an answer. Or three. Or ten. More answers is good. What we must not do is rest on our collection laurels. Those laurels are about to be stomped flat and dumped in the gutter. Get ready.

This year feels to me the way 2005 felt in the housing market. Big stupid money was still rampant, but the foundation-cracks were evident to some wise souls. Lots of happytalk and problem-denial all over the place. And then everything went off the cliff. I don't know if this will turn out to be an apt comparison, nor am I entirely sure what the cliff-drop will look like. I do think it's coming, though. I do think that.

2 responses so far

Library contracts and journals 101

Dec 01 2010 Published by under How Libraries Work, Tactics

Libraries sign a lot of contracts to get access to content. A LOT. Think of your household and multiply by it by a thousand or more. The bigger the library the more contracts they sign.

Because we do this with so many publishers, organizations, societies, etc. there are other companies set up to manage all these subscriptions, standing orders of book series, and the like. We call them "subscription agents." These agents are so important that they usually give the biggest parties at the largest library conferences. And we all know that's  the true barometer of clout in a profession.

The American Chemical Society, or ACS (there they are again) recently sent out information on next year's journal subsciption costs to libraries. Now, you may have guessed from earlier posts of mine that the ACS can be a little conservative with regards to publishing. Well, they are conservative with respect to ownership of content, copyright, and open access, but with repsect to licenses and pricing they seem to be quite different.

One good example of this is our library agreement with ACS journals. We pay annual costs for both the new content (called the ACS web editions) and the archive (the Legacy Archive.) This two-tier pricing scheme has been in effect for some time. There are other societies that have access to journals set-up differently (one price for all content, or no archive is available) but most commercial publishers have a similar two-tier system in place.  One example of this is JSTOR in the humanities and social sciences.

Another option publishers have is to bundle journals together into packages. Since ACS journals are in high demand, most larger academic libraries have an all titles (or All Publications) package. This is convenient because when a new title is released you don't have to start a new individual subcription.

All well and good - these systems have been in place for some time and usually eliminate unncessary paperwork in renewing subscriptions. Long story short is our state-wide agreement with the ACS ended, and we had to negotiate with them to renew our subscriptions. In our case, the price increase was manageable (maybe 5%). Some of the possible reasons for the price increase, which were explained to me from our ACS rep, are as follows:

1)     Many places received an early payment discount in the past, which was not factored into the base price for next year. So the base price was raised. While this is odd, it wouldn't surprise me if ACS was in fact doing this and/or it wasn't clear on the invoice. I would recommend libraries check their previous invoices to see if this is reason for part of their increase.

2)     ACS added two new titles to the web editions package, so this subsciption was raised accordingly. This seems to fair to me and in line with previous increases.

3)     The ACS legacy archives costs increased for many schools, and in some cases the cost doubled.  In our case this didn't happen but if a school was in a consortial arrangement this may be the reason why the bill is so much higher.

So there you have it. I don't think this is an evil plan from the ACS but rather an opportunistic way to redefine the parameters of the contract without breaking them. When you sign a contract, as long as the terms are obeyed there's little recourse.

I have seen so many confusing pricing deals from the ACS that after this renewal was settled I moved on. I didn't realize until later that some libraries are seeing very large increases, 20% or more. There's more discussion of this on friendfeed.

I predict we will see more of this pricing instability, especially as newer publishing models develop and mature.  Unfortunately this is an area, as we've seen with the Nature situation earlier this year, where information it's not possible to openly  share this information. So speak up, tell your stories, and make some noise when there's a big price increase.  Tell faculty why journals are being cancelled.  In most cases it's not because of content but other reasons. I predict this increase will cause some libraries to cancel some ACS subscriptions, because the increases will be too large for them to sustain and the increase is coming late in the year, when it's harder to absorb a larger hit.

12 responses so far

In-tech and Lazinica at it again

Dec 01 2010 Published by under Open Access, Praxis

This is by way of a public-service warning.


Lazinica has the dubious distinction of being the only (as far as I know, anyway) publisher to be told by OASPA to take their logo off his site. Looking through the current In-tech offerings, one is bombarded with nonexistent copyediting and appalling typesetting. I can only guess acquisitions and review standards are equally low or lower, especially the way the outfit goes around trawling for authors.

This is not an outfit that will do your academic career any good. Stay away. Can I interest you in a nice PLoS or BMC instead?

Last I checked, In-tech's journals were still listed in the DOAJ. If I were DOAJ, I'd rectify that problem, but I'm not. And other than OASPA telling Lazinica he can't use their logo, they've been silent on the subject.

So I do what I can to spread the word. Somebody should.

4 responses so far

« Newer posts