Archive for: December, 2009

New Year tidbits

Dec 31 2009 Published by under Tidbits

Wishing all of us a happy, prosperous, data-filled 2010.

See something that should be part of a future tidbits post? Comment here or tag it "trogool" on

No responses yet

Making author authority easier

Dec 29 2009 Published by under Praxis

I wrote last week about name authority control for authors. I hinted that systems are coming. I hope that journals, databases, catalogues, and repositories adopt them when they emerge, the sooner the better.

Even when they do, though, there's an immense problem to solve, in the form of the millions (billions? I shouldn't wonder) of articles that will have to be retrofitted into the system. It's work not unlike what I'm doing at the moment, so I can say with authority (sorry, sorry) that it's often not easily accomplished.

Researchers, institutions, others, you can do some things to make the transition easier. What's in it for you? Correct credit for your work, your colleagues' work, your students' work! (Check Scientific Commons or CiteSeer for a sense of the scope of the problem.)

  • Researchers: track your graduate-student researchers, publicly, with their full names. If you've ever published with them, they should be listed, regardless of whether they achieved the degree. (There's a rather better than even chance a name I can't resolve belongs to somebody's grad student, based on names I've had to spend a long time chasing down, and the methods that tend to work.)
  • The same goes for research staff that receive credit on published papers, and people who once worked with you but have moved on. I see some research groups with "alumni" lists on their websites. These are fantastic! They're even more helpful with dates of tenure.
  • Conference organizers: resist the temptation to use initials on published conference papers. Please use full names. Sure, everybody in the field knows good old "Smith, J." I'm not in the field and I don't.
  • Conference organizers: please publish presenter lists and/or conference schedules with full names. It helps a lot!
  • Women: I am acutely sensitive to gender issues in academia; I live them too, though assuredly not as much as many. I know there are strong incentives in various fields to conceal your given name in order to conceal your gender. I don't and wouldn't fault you for doing that. I am saying that the jig is up, however; the coming author-ID systems will make concealment impossible. Please, at least consider building a professional web presence keyed to your full name, so that people like me can ensure that people like you get all due credit for your hard work. Thanks. (There's a pretty good chance a name I can't resolve belongs to a woman, again based on names I've had to work extra hard to track down. Most of my initialisms hail from physics and engineering.)
  • Institutions: consider providing a public search of your alumni database. The one at my institution has been utterly invaluable (those invisible graduate students again!). For my purposes, I need the student's full name along with achieved degree(s) and degree dates, and I need to be able to search on last-name-first-initial.
  • Research assessors of all stripes: please publish your institution's author lists. A way to federate these for searching would be even better, but I'll take what I can get. (In many articles, what I have to work from is last name, first initial, and place of employment.)
  • Cataloguers: let's revisit transcription rules. I can entirely understand that being required to control every name in the catalogue, no matter how minor, is an unacceptable amount of work. However, being forbidden to control some names (e.g. local dissertation advisors) is going to cause problems upstream for people like me. Let's fix this now, shall we?
  • Repository managers: Fixing names is fiddly, time-consuming, difficult, frustrating work. Let's do it anyway. It'll have to be done at some point!

Interesting times for metadata managers… it would be awfully nice to federate some of this work. Ah, well.

No responses yet

Top-down or bottom-up?

Dec 28 2009 Published by under Tactics

As I watch the environment around me for signs of data curation inside institutions, particularly in libraries, I seem to see two general classes of approach to the problem. One starts institution-wide, generally with a grand planning process. Another starts at the level of the individual researcher, lab, department or (at most) school; it may try to scale up from there, or it may remain happy as its own self-contained fief.

As with anything, there are costs and benefits to both approaches.

Some of the challenges of data-driven research carry costs and infrastructure that only make sense on an institutional level at this juncture. Grid computing. Gigantic, well-managed disk. (Gigantic disk is fairly cheap. Gigantic well-managed disk will cost you. In my mental model of the universe, I include such things as periodic data audits and geographically-dispersed backups in the cost of disk.) Authorization and authentication, which is a bigger problem than you might think. Carrots and sticks, if the institution is serious about this.

So it makes a certain amount of sense to try to tackle this problem as an institution. Where the institutional model falls down, I begin to suspect, is service beyond the bare provision of appropriate technology. Training and handholding. Outreach. Help with data-sustainability plans in grant proposals. Whipping data into shape for the long term. Advice on sustainability, process, documentation, standards—the nuts and bolts of managing data in a particular research enterprise.

Because data and their associated problems are as varied as the research that create them, I just don't think it's possible to open a single-point-of-service "data curator" office and have that be an effective solution (save perhaps to extremely small, targeted problems like grant proposals). I do still believe that almost any reasonably bright, decently adventurous librarian or IT professional can walk into almost any research situation, get a read on it, and do good things for data. I've seen it happen! But the "getting a read" part takes time and a certain level of immersion. How can a single point of service, whose responsibility is to the entire institution, spend that much effort targeting specific research groups?

Simple. It can't. Moral of the story: data curation is not a Taylorist enterprise.

In practice, I suspect, institutions that create the Office of Data Curation without carefully considering what I just outlined will inexorably wind up serving only a small proportion of the institution's researcher population. It's quite likely to be the proportion of said population swimming in grant money and prestige, of course. The arts, humanities, and qualitative social sciences are most liable to be left hanging. I already see this happening one or two places I know of—not because they have bad or thoughtless people, not at all, but because good people have been handed an organizational structure ill-suited to the task at hand.

Can such a structure be made workable? Perhaps. It'd take some work from the grassroots. Were I in that situation, I'd be canvassing my campus for every single person on it—librarian, IT pro, grant administrator, researcher, graduate student, whoever—who "does data" in some way. Then I'd be working like crazy to turn them into a community of practice.

I admit I'm a little hazy on how communities of practice form and how they can be encouraged to form; I'm sure there's research on the subject (and would appreciate pointers to same). I must also admit that I've tried multiple times to form one around institutional repositories and quite resoundingly failed.

I can only say based on those failures that much depends on what the community-former has to offer, as well as how ready putative community members are to consider themselves part of a coherent community. In this case, how well would it work? I don't know. I'd want something fairly compelling to offer, to get the ball rolling—perhaps some of those institution-wide resources.

About data fiefs I don't have much to say. They exist already, notably in the quantitative social sciences. They seem to work quite well from a service perspective. Unfortunately, some of their technology practices, especially around data sustainability, set my teeth a bit on edge. Format migration? Audits against bitrot? Standards? Persistent, citable URLs for public data? Not so much, some places. And let us not even discuss what happens when the grant money runs out. These places usually aren't geared for the long term, though they do quite well in the medium (say, five to twenty-five years) from what I've seen.

If you think I think there's a sweet spot somewhere in the middle here, you know me entirely too well. At least some of the outlines of the ideal state seem clear: where the rubber meets the researcher, local staffing and control; where the problem goes beyond what local can responsibly or effectively manage, the institution steps in. Likewise, the institution has a responsibility to researchers who need data help but can't afford it locally, in their lab or school or department. There should not be coverage gaps.

By the way—there is, in fact, one organization common on research-university campuses that has learned to be (more or less) centralized while still providing discipline-aware, often discipline-specific, services. It does rather remarkable work serving all campus disciplines, as fairly and skillfully as an unjust world permits. A way out of the Taylorist paradox, perhaps!

What is this wonder organization? It's called "the library."

9 responses so far

Tidbits, 22 December 2009

Dec 22 2009 Published by under Tidbits

Every time I do a tidbits post, I think to myself, "gosh, that was a lot of tidbits; I'll never fill up the queue again."

Every time, I'm wrong.

Happy holidays to those celebrating them.

Want to help me collect tidbits? Tag them "trogool" on, or leave a comment in a tidbits post. Like, er, this one.

2 responses so far

Authority control, then and now

Dec 18 2009 Published by under Uncategorized

Since the end of the year is a fairly quiet time for my particular professional niche, I've taken the opportunity to do some basic name authority control on author name-strings in the repository.

Some basic what on what, now? Welcome back to my series on library information management and jargon.

The problem is simple to understand. Consider me as an author. I took my husband's surname upon marriage; fortunately, I hadn't published anything previously, but I might have done—and if I had, how would you go about finding everything I've written, if it was published under two different names? "Dorothea" is a fairly distinctive given name, especially in my age cohort, but I do share it with other creators.

Now consider creators whose names are not written in Roman characters. The many and varied romanizations of the composer Tchaikovsky may give pause, though my personal favorite example is a certain Libyan leader who wrote a book or two. (Click over and then hit the plus beside "400's: Alternate Name Forms.")

Libraries confronted this problem when the search technology of choice was the card catalogue. The outline of a solution emerges: to avoid wasteful duplication of cards, all the cards representing titles by a given author should be in one place under one name, but it should also be possible to pop in a single card for each additional name variant so that searchers know which variant is hiding the good stuff. ("Chaikowsky, Peter Ilich: see Tchaikovski, Piotr Ilyich, 1840-1893.")

This means choosing a preferred name variant, of course. Ideally, we'd like this to be consistent across libraries, so that the devotee of Russian music who learns the preferred variant in her home library will easily find what she needs at any other library.

There are additional wrinkles as well: it does happen that different authors wind up with the same name, and for library purposes, that's no good. My husband David, for example, shares his name with a book-writing swimming coach. Libraries chose to use birth years—and, only if necessary, death years—to disambiguate.

Aha, you say. This is why not all author names in library catalogues have attached dates. This is why not all authors with listed birth dates have death dates, even when they'd have to be older than Methuselah to be living still. Yes, this is why. Dates in author headings started strictly as a disambiguation measure; the swim coach didn't have his birth year beside his name until my husband turned up and wrote a book. Of late, there have been raucous arguments among cataloguers in libraryland about adding death dates as a matter of course.

All of this activity—choosing preferred name variants such that each name listing remains unique, listing other name variants with the preferred, organizing by-author displays accordingly, coping with name changes—is called "name authority control." (It has an analogue for subject work, sensibly enough called "subject authority control." This verges on the topic of controlled vocabularies, which is definitely one for another post. Or six.) For catalogue cards, this solution is remarkably elegant and entirely functional. For computer-based record management—well.

Relational-database experts are howling right now, at the idea that a primary key—what's used to identify a particular row of information, a particular item, in a database—would ever change. The whole point of a primary key is its immutability! Ask for record number 91346342, always get the same record. You never, ever, ever change that record ID. Ever. Really, not ever. If a particle of information can change, it shouldn't be used as a primary key!

Linked-data experts are howling as well: why don't all these people have URIs? (If you remember your analogies from the SAT, database:primary key::RDF:URI. Roughly, anyway.) Well, they do, now, thanks to VIAF. Here's my VIAF URI (no, I have no idea why my birth year is included in my authority string, as my name by itself is unique in authority data; ask a cataloguer) to look at. Feel free to hunt for your own URI.

To some librarians, all this business of immutable identifiers may sound like specious wrangling, but it's not: it's actually a major disjunction among cataloguing practice, the databases underlying ILSes, and the perennially-emerging world of linked-data mashups via RDF. Inexpert programmer that I am, the idea of programming around library methods of authority control makes my head hurt. It leads to real problems making online catalogues work well (never mind library systems that aren't tied into authority control, such as digital-library platforms and institutional repositories), and making library data play nicely with other people's data. When gearhead librarians and other technologists say "library data is siloed," this is exactly the sort of thing they mean.

You may, particularly if you are a hard scientist, have noticed another hole in this system: you don't get into it unless you have written a book. (Exceptions, yes, for editors and composers and book illustrators and whatnot. However.) I, for example, had two or three articles and book chapters come out before co-authoring a book published in 2008. I didn't have an authority record until the book was catalogued. If all you've published are articles, you don't have an authority record, sorry.

This is becoming a serious problem! If it were just people like me struggling with it, that wouldn't signify; as a librarian, I'm supposed to struggle with this sort of thing. I learned hotshot DIALOG-searching tricks in library school to get around article databases' lack of name authority control, for instance. Right now, I've built up a strategy for finding physicists' and engineers' first names that mostly works, though I do wish whatever weird graduate-school midnight hazing ceremony that deprives these worthy people of their given names in favor of their initials would wither away and die. (I am joking. Mostly. This phenomenon, though of course it isn't the result of hazing, can be maddeningly difficult to rectify, especially when the author in question is a graduate student who either doesn't graduate or doesn't go on to an academic career.)

No, the real problem concerns the changing nature of performance measurement in academia, mostly in the sciences to date. As journal impact factors wane in importance (not nearly fast enough for me!), the importance of measuring the impact of individual articles and other publications via citations and download counts rises. How are we to measure this anything like correctly for a given author if we can't reliably match articles to authors?

In an article published earlier this year, I wrote that there was a ferment of activity around the question of author authority, and what would come of it all was far from clear. I'm happy to say that clarity is emerging, in the form of ORCID: the Open Researcher and Contributor ID initiative. This effort looks to me to have critical mass and brainpower to make a difference: publishers, libraries, technologists, and research funders are all involved.

In the meantime, I plod through the repo's author listings, making what minimal order I may, very desirous of a better solution.

8 responses so far

"Just print it!"

Dec 14 2009 Published by under Praxis

A common response, including in the comments at Book of Trogool, to raising digital-preservation issues is a chortle of "Guess print doesn't seem so bad now! Let's just print everything out, and then we'll be fine!"

Leaving aside my own visceral irritation at that rather rude and dismissive response—no, we won't. "Just print it out" doesn't stand up to a moment's scrutiny. Let us scrutinize a moment, shall we?

Problem number one is the variety of digital materials that become useless the instant they are printed, or cannot be "printed" at all. Hypertext. High-resolution imaging, as from microscopy or any number of other digital-imaging processes. Endless columns of numeric data. Source code. Games. Et cetera.

Problem number two is the sheer volume of output we're talking about. You tell me how much paper and ink it would take to print out a night's worth of astronomy observation data from a single telescope, or an entire time-series of microscopic cell observations in 3-D. It's not even remotely feasible.

Problem number three is storage space. Think libraries or archives can take that volume of paper? Think twice. Every research library I personally have any data on (and I have a fair professional network, plus I do my professional reading religiously) is bursting at the seams with physical materials already. Raising the incoming volume by a power of—well, quite a large number, really—is not on.

Problem number four is organization. You think the piles of paper on your desk are bad now? Digital metadata scares you? You ain't seen nothin' yet. (Which reminds me, I need to get back to my discussion of library standards and practices here. I will do that.)

Problem number five is discovery. Who's going to know what data have been printed, where they are, and how to obtain them? If you hate your local library's online catalogue (bias disclosure: I'm not particularly fond of any library online catalogues, though some are less bad than others), imagine it as the sole source of information about datasets.

Problem number six is delivery logistics. Someone wants to work with your data. WIll you FedEx them the printouts? (If you can find them; see problem number four.) Your originals, or a photocopy? Who makes the photocopy? Who pays for all this? How?

I understand the impulse to retreat to a form of knowledge management that seems comfortable, safe, familiar, and easy. I do. I will also point out, though, that "easy" is in the eye of the beholder: there is an immense resource and skill scaffold underlying analog preservation already, in libraries and archives and museums. That it's invisible to most people—ever visited a book conservation lab? or a bindery? or a microfilming center? or a storage vault? I recommend it; they're fascinating operations—doesn't mean it's not there.

We need similar scaffolding for digital preservation. We don't have it yet. That doesn't mean it's impossible to construct, nor does it mean we should or even can retreat to a print-only world.

So, please, let's stop pretending that's a possibility. Further comments along these lines at Book of Trogool may, depending on how I feel that day, be quietly ignored, ruthlessly deleted, or mercilessly mocked.

9 responses so far

Tidbits, 9 December 2009

Dec 09 2009 Published by under Tidbits

I'm at home today owing to last night's epic snowfall in Madison shutting down practically the entire university, so it's time for tidbits!

Regarding that last one, would it be helpful for me to try to maintain a jobs roundup here? If you think so, drop me a comment. I'd also appreciate pointers to good places to spot such jobs. I know most of the library sources, but based on this poster helpfully pointed out to me by commenter Nic Weber, a lot of the job ads will go out in science venues.

Here's hoping my choir's dress rehearsal scheduled for tonight can actually happen… in the meantime, I raise my hot-chocolate mug to you all.

5 responses so far

Avoiding roach motels

Dec 08 2009 Published by under Praxis

The latest issue of the International Journal of Digital Curation is out; if you're in this space and not at least watching the RSS feed for this journal, you should be.

I was scanning this article on Georgia Tech's libraries' development of a data-curation program when I ran across a real jaw-dropper:

One of the bioscientists asked the data storage firm used by one of the labs recently about the costs associated with accessing data from studies conducted a few years ago. The company replied, "you wouldn’t want to pay us to do that. It would be less expensive to re-run your experiments." (p. 88)

Ouch. The immediate question springing to mind is "why is this lab paying these people to store data if the stored data then become unretrievable by the original depositors?" Roach motel: data goes in, but it doesn't come out!

It seems to me that the lesson here is making even seemingly-obvious requirements explicit when expensive service provision is in play. You would think that retrieval is an automatic concomitant of storage. I sure would. Apparently not!

I've run into similar problems before, but in the best story I have on the subject, the problem was format-related. I retell the story in order to warn people to be wary of hotshot black-box content-management systems.

I once worked for a scholarly-publishing service bureau. The company did editorial work, typesetting, art, design, and SGML/XML-based workflows, which is the division I was in. So one of our SGML clients was shopping for a content-management system to manage and archive all their publishing material. Sensible enough. They specifically asked each vendor whether they could retrieve the same SGML from this system that they'd put into it.

The vendor they eventually contracted with assured them that they could. Not to put too fine a point on it, the vendor was telling untruths. The SGML was munged on ingest into whatever unholy lossy proprietary mess the CMS used natively, and could not be retrieved intact therefrom. Our client didn't find this out until after the purchase, of course. There was talk of lawsuits; I don't recall where that went.

Slight happy ending: our shop had its own project-archiving procedures, so the client didn't lose any SGML that we had provided them with.

Don't let any of this happen to you! Ask questions that seem stupid, and make your counterpart commit to the answers you want.

3 responses so far

What is the impact of discovery tools on researcher self-archiving behavior?

Dec 07 2009 Published by under Tactics

This is the question I was asking myself while reading this fairly straightforward paper on open access in high-energy physics (hat tip to Garret McMahon).

It's impossible to be in my particular professional specialty and not know about the trajectory of self-archiving in high-energy physics, but I learned a smallish detail from that paper that intrigues me rather: the existence of SPIRES, a disciplinary search tool that covers both the published literature and gray literature such as preprints on arXiv.

This strikes me as a rare thing. We have disciplinary gray-lit search tools such as RePEc in economics, and we have no end of disciplinary published-lit search tools (despite the considerable expense of securing access to them), but tools that do both? Within a given discipline? I'm not a reference librarian, so discipline-specific search tools aren't my specialty at all, but I can't think of anything else on the SPIRES model. There's WorldCat and Google Scholar, of course, but neither of them is discipline-specific. EBSCO is known to index some library blogs for its library-science databases, but they don't touch DList or E-LIS as far as I'm aware. Law might have some interesting things going on, given the novel importance of blawgs, but I don't know of anything firsthand.

SPIRES makes me wonder, it really does. Imagine you're a high-energy physicist (take that in either sense or both!). You search SPIRES; you know all your colleagues do, too. You have two ways to get your work in SPIRES so that it's in front of their eyes: pop a preprint on arXiv, or go through the slow process of peer-reviewed publishing, a process that you don't believe will change your paper much.

This is not the narrative that one typically sees regarding high-energy physics and self-archiving. It's usually seen as a continuation of a print-culture norm of circulating preprints individually by mail. Still… I wonder.

What is the relevance of this little idyll to research data? This: If data are not indexed where researchers expect to search for disciplinary materials useful to them, will data be used? Taken seriously? Cleaned up and placed online in the first place, even? "Discoverability" of data, in the broad sense of "availability to web search," may not be enough. Discoverability through discipline-appropriate channels, alongside other trusted materials, may well be the key.

Or so it seems to me.

5 responses so far

Training? Or jobs?

Dec 04 2009 Published by under Praxis

There have been a number of piercing calls for training of data professionals (of various stripes) in the last year or so. Schools of information have been answering: Illinois, North Carolina, others.

Honestly, I'm getting a sinking feeling in my stomach. If I were to label it, the label would go something like "where are all these newly-minted data professionals going to work?"

My stomach sinks worse when I realize that quite a few of the calls are coming from the same people and organizations who uttered piercing calls for the establishment of institutional repositories in the early 2000s. Libraries did as they were bid; the results were at best mediocre (and that's a generous assessment). The callers have not, to the best of my knowledge and belief, acknowledged any error in the call they made, much less any of the waste and damage caused. So… we're going to trust these same people on a similar leap into the half-known?

The larger question is how we move data professionals into the research enterprise. It's an analogous question to others that have surfaced in libraries: moving librarians into the classroom to teach more than Booleans, for example. We'll hear some of the same things from the people we want to help: "a solution in search of a problem," notably, as well as "how can you possibly understand my research if you're not just like me?"

(My answer is what it's always been: "I don't have to understand your specific data to tell you that keeping data on CD-ROMS in a shoebox under your desk is a bad idea.")

I've seen one answer I like: internships. GSLIS at Illinois moves its data-curation students into data-related internships once they graduate. They beat the bushes for research organizations looking for the kind of help their graduates provide. In so doing, they ease their people into jobs, raise the profile of their program, and raise the profile of information professionals as research partners generally. This is smart business. I go further: I believe it wholly irresponsible to have a data-curation instruction program targeted at librarians and information professionals without such an internship program.

Training scientists is another question, of course; I don't think it's quite as necessary to do internships in the well-accepted informatics fields. It probably can't hurt, though.

Grant funders: I'd like to see some bribes happening. Make money available to grantees to hire on data professionals. The wording of such grants will be tricky—you don't want them hiring just another developer—but I'm sure you can do it. Likewise, fund the internships I just described! Finally, any research you can fund that demonstrates good outcomes from the presence of data professionals can only help.

Institutions: I don't know; I truly don't. Some days I believe that data management can only happen on the level of the individual research lab. Some days I believe that data can only survive if institutions tackle the problem. Some days I believe both, and my head hurts.

We all of us need to avoid some obvious pitfalls, however. The maverick-manager pitfall familiar to libraries from the IR disappointment is one: data curation for an entire research institution cannot become the exclusive purview of one or a handful of supposed data professionals, especially when they have no budget, no developers, one server at most, and no institutional network.

Flooding the job market is another. Data professionals will lose what little credibility among researchers we have if dozens of us wind up applying to every open job. That leads to perhaps the shortest road to deprofessionalization in history! Let's not do it. One way to avoid it may be to bite the bullet about incoming qualifications: perhaps we need to sigh and say "no science BA, no enrollment in this program; MAs and Ph.Ds in science preferred."

That slams the door on me, incidentally, and I wouldn't be happy about that. But if it means that newly-minted professionals have obvious job-market value, then that's what we have to do.

Finally, let's not get quite so exercised yet about who does what work; we risk "I stubbed my toe! Call in a specialist!" syndrome. Let's focus on the work to be done. Work has a marvelous way of getting done, when it has to be, even by people who aren't "professionals" and don't have "professional" training. I am not a professional programmer. I don't have the least hint of a degree in computer science, software engineering, or anything else. I still write code, because the code won't write itself. If similar processes are how data curation turns out to happen, that's fine with me.

Not least because then I won't have "professional" doors slammed in my face.

11 responses so far

Older posts »