Archive for the 'Praxis' category

Christine Borgman on data

Sep 14 2010 Published by under Praxis, Tidbits

Christine Borgman has a lengthy track record of saying smart and apposite things about scholarly communication and research data. (See my review of her 2007 book here.)

She has done it again, in a conference paper entitled "Research Data: Who will share what, with whom, when, and why?" If you liked my Ariadne article at all, you will love this, I guarantee it. Strongly recommended, so much so that I didn't want to wait for the next tidbits post.

Comments are off for this post

Institutional repositories and digital preservation

Sep 07 2010 Published by under Open Access, Praxis

With all the pressing issues the open-access movement has to deal with, I honestly don't understand why we scrap over digital preservation. But scrap we darned well do, so I'll toss my two coppers in the pot.

They amount to this. Digital preservation is not a single thing one does or doesn't do; it's a whole constellation of things, some of which matter more than others. By and large, considering real-world threats instead of playing digital security theatre, institutional repositories do fairly well at digital preservation. They could (and would) do better if institutional-repository software integrated better with file-format analysis tools.

I have no patience for "it's about open access, not digital preservation!" arguments. There is no access, open or otherwise, without at least basic preservation steps. We can see this principle in action, even: the disappearance of DList (the US library and information science repository) and Mana'o (a disciplinary repository for anthropology) removed quite a bit of material from the public eye.

Likewise, I have no patience for thinking of digital preservation solely in terms of technology. DList and Mana'o are the biggest, most glaring examples of access failure in the repository realm. (We don't actually know whether they were full-on preservation failures; the content may still exist out of sight somewhere. Or it may not, in which case we indeed have a failure of preservation.) In both cases the failure had nothing to do with technology: it was organizational and business-model failure. Both DList and Mana'o started as single-person projects. Neither made adequate contingency plans for the obvious risks of letting repository survival depend on a single person. The single person ran into time and energy limits. Nobody picked up the slack. The repositories died. QED.

(Think it can't happen to you? Ask yourself what would have happened to arXiv when Ginsparg got tired of it if Cornell University Libraries hadn't white-knightly charged in. I think it would have died too, myself.)

So if the major observed risk to content preservation is failure of organizational support, IRs hold up pretty well. I've been quite caustic in my time about institutions' and libraries' failure to support IRs adequately (and sadly, I have another acerbic article brewing in the back of my head) but I will happily say that I've never seen or heard of an IR whose sponsors weren't aware that they were taking on a serious obligation to the content they collect. Score one—and it's a big one—for the humble IR.

Regarding technology-specific threats, most IRs are far from perfect, but they're a good deal better than nothing. DSpace IRs, for example, do checksums on everything they ingest, and those checksums can be regularly audited. Assuming halfway-decent backup behavior (and yes, this is an assumption), this reduces bitrot danger to near zero. File-format obsolescence is often remarked upon as a problem, and it is true that IR software does not do all it should with tools like JHOVE designed to evaluate file formats and point out problematic files. Frankly, though, I'm with David Rosenthal and Chris Rusbridge on this one: mass-market file formats such as most IRs contain rarely become completely unreadable. Information loss, sure (fonts and formulas particularly), but not often and not much.

IRs could also stand to do better at geographic replication of their contents… but once again, this is an organizational issue, not a technology one. It's been addressed in a few cases, so we know pretty well how to do it; our organizations just aren't stepping up yet. I think the Duraspace cloud efforts have brought this question to the front burner, and I expect matters to improve within the next year or two.

Finally, an oft-forgotten part of the IR preservation strategy is the human beings behind IRs. By way of example, I've adopted a few websites into the IRs that I've run. Before they go in, I check internal links, I remove unnecessary Flash (practically all of it, that is) with extreme prejudice, and I clean up unnecessarily nasty HTML. I'm pretty confident those sites will do all right for quite a long time because of my interventions.

So can we stop arguing about digital preservation now, please? Plenty more productive arguments we could be having.

5 responses so far

Escaping Datageddon: comments, please

Aug 31 2010 Published by under Praxis

I'm due to give an introductory talk on data management to a group of graduate students later this fall. Since I like to steal from the best, I cribbed heavily from MIT's most excellent guide on the subject, particularly their slidedeck, but I thought I could perhaps improve a bit on that deck's organization, as well as cut down some on the information firehose without losing the main points.

I consider this still under construction, so feedback is most welcome.

Escaping Datageddon

2 responses so far

Avoiding embarrassment via open data

Aug 11 2010 Published by under Praxis

Melody has a fantastic post on the Marc Hauser cooking-the-data-books scandal. I won't even recap; just go read it.

Individual researchers' cavils aside, science has a fairly compelling reason to push for open (or at least opener) data: turning away the Hausers of this world before they start gaming the system, as well as catching them before they turn into immense embarrassments. How much hay gets made over these scandals by anti-intellectual science-haters? "Any" is too much, but I'm guessing it's a lot.

No, of course open data is not a panacea, and human-subjects data particularly is sensitive and hard to make open—but even "openness to reviewers" can only help.

Comments are off for this post

Introduction - the Honor System

Jul 14 2010 Published by under Miscellanea, Praxis

As a new blogger here at Book of Trogool I'd like to thank Dorothea for the opportunity to share in the discussion of evolving issues in technology, libraries, research, and scholarly communication.

I'm currently the Scholarly Communications and Library Grants Officer at Binghamton University, in upstate New York. I've been a librarian for some time (12 years now) and before that I was a chemist, with research experience in inorganic photochemistry, surface science reaction dynamics, and equine drug detection and quantification methods. While I did different experiments in each lab, each place was surprisingly similar in its culture and practice, and it was this lack of creativity in the research process that drove me to librarianship, although many of the projects themselves were interesting and insightful.

While I'll share some ideas from my library experiences here, in my current role I frequently find myself going back to my roots, so to speak, to understand and share the challenges of these emerging tools, behaviors, and systems. I understand things best by analogy and metaphor, and I think to better understand a new or changing culture you can find a lot of the answers from the past.

When I started as a serious (i.e. college) student, I attended and graduated from the University of Virginia. In addition to having its own well-defined culture, nomenclature, and social environment it also had the Honor Code and System. Anyone familar with honor systems knows the essense of the system is trust. UVa's system is pretty unique in that it has been entirely student-run since its inception in the 1800s, and only your peers could accuse and convict you of cheating. The system also had a single-sanction rule, so one offense and that was it. You were gone.

This system was not without some peculiarities. If a homework assignment was pledged, as we called it, you couldn't work with anyone. As a science major I could never ask a classmate for help with my assignments and lab reports, so I could never collaborate on anything or learn from my peers. I also never got an final exam returned to me, so I never knew what I didn't learn from a course. In practice it isolated and sequestered knowledge and information.

This single-sanction system is lot like the traditional publishing environment. Research output is carefully controlled and hidden prior to publication - no one can see the research until the final paper is published. If you go outside "the rules," just like the single sanction, your credibilty can be challenged and your reputation can suffers. Just like the honor system, once you're out its permanent. As a result there is little incentive to innovate with new methods of communication technology or produce output that is not recognized by the honor system.

So while Honor sounds great in theory its not so useful in practice. I use the capital because academia still abides by this principle. You see it in the tenure and promotion decisions and the way campus business and policy is conducted.

Another characteristic of honor is that it is very personal. And this is another disconnect I see with how traditional research is happening today. The publishers feel they are bestowing honor on the researchers by accepting and publishing their manuscript, and the researchers feel their research output and projects are giving honor and prestige to the publication. And this, I think, is where the challenge lies - to convince each group that their honor resides within themselves, and isn't transferred between one or the other in order to become legitimate. Never once as a student did I think the University made me honorable, or gave me honor by being there, I demonstrated honor by my actions and behavior.

Can this be changed? I hope so, because the culture can't continue in its current state. I hope to explore the issues and I encounter in discussions with faculty, students, researchers, administrators, and policy makers, and provide advice and strategies to affect positive change. I'll also try to explain some of the oddities of library culture, specifically academic library culture, which can be perplexing to anyone not immersed in this environment.

2 responses so far

Promoting a comment: "Open and shared format"

Jul 09 2010 Published by under Praxis

Richard Wallis has taken my ribbing in good part, which I appreciate; his response is here and will reward your perusal.

He also left a comment here, part of which I will make bold to reproduce:

As to RDF underpinning the Linked Data Web - it is only as necessary as HTML was to the growth of the Web itself. Documents were being posted on the Internet in all sorts of formats well before Tim Berners-Lee introduced us to the open and shared HTML format which facilitated the exponential growth of the Web. Some of the above comments are very reminiscent of the "why do I need to use HTML" discussions from the mid 1990's.

It is an open and shared format, such as RDF, that will power the exponential growth of the Linked Data web, but the conversations around it are still at the equivalent of 1995 stage.

If I read this right, Richard is not actually saying that the web is all HTML and therefore HTML is Good and All Web Things Must Be HTML. That's good, because that would be a silly thing to say. The web I use has plenty of CSS and Javascript and XML and JSON and JPEGs and PNGs and Flash (gah) and PDF (double gah) and other stuff on it.

What Richard is saying (again, as I read it) is more subtle: widespread growth of the data web requires an open standard to cut through the Babel of competing and closed formats the same way that HTML cut through the Babel of document formats, because without that interoperability is too much effort and so no one realizes the benefits.

Richard is welcome to check my understanding; I may have this completely wrong. Nonetheless, I don't believe a word of it, and I especially don't believe it if RDF is the HTML analogue (which, let's be clear, Richard very carefully did not say). Here's why I don't.

First, HTML was hardly the only part of the web stack necessary to its explosion. TCP/IP, anyone? Moreover, HTML by itself is obviously insufficient as the driver of that explosion, or we'd all still be on Gopher (remember Gopher?). Formatted strings of words are not all we monkeys interact with. Neither are assertions, about documents or anything else. (The whole thing about "not all data are assertions" seems to escape some of the die-hardiest RDF devotees. I keep telling them to express Hamlet in RDF and then we can talk.)

Second, I don't know that we need to rely on a single data format for interoperability. It's not impossible, but remains to be proven. The data web that I personally think is more likely closely resembles today's mashup and microformats cultures: lots of formats with suitable documentation (one hopes) and APIs, available for use by whoever's willing to suss out how the various datasets work and write code to glue them together. It's a rough-and-ready sort of interoperability, arguably an inefficient one, but eppur si muove, as Galileo did not say of the web.

Third, I'm not entirely convinced we need to rely on interoperability and its network effects as our incentive toward data-sharing. Tim BL certainly did; there wasn't much technical precedent for what he was up to. But we have the web already, a cogent argument if ever there was one. We also have governments, grant agencies, and businesses wanting to multiply return on investment in data. RDF seems downright small-potatoes by comparison, as incentives go.

Finally, the HTML:RDF analogy falls down in one area that I think is utterly crucial: ease of adoption. I can teach enough HTML (and CSS) to be going on with in a couple of hours; I've done it. I still touch RDF only with great fear and loathing and a constant sensation that I must be doing it wrong, and I'll teach it only when I absolutely must and with a great many "I don't pretend to understand this" disclaimers. You can't frighten me with XML namespaces, XPath, XSLT, or regexes, but RDF scares me stiff. This is not an open standard that's going to rule the world. Not today, not tomorrow, and in my opinion not ever.

There's another danger lurking in the one-format-to-rule-them-all argument, a danger I hinted at above: what happens to data that for whatever reason aren't expressible in the format of choice? Second-class citizens? Invisible? I hope not.

Anyway, I say again: if the data web depends on RDF, the data web is a pipe dream and we should look for something else to do. I'd much rather believe the "if" clause counterfactual.

5 responses so far

I'd love to dance with you, but...

Jul 06 2010 Published by under Praxis

Richard Wallis of Talis (a library-systems vendor) posted The Data Publishing Three-Step to the Talis blog recently.

My reaction to this particular brand of reductionism is… shall we say, impolitic. I just want to pat Richard on the head and croon "Who's the clever boy, then? You are! Yes, you are!" This is terrible of me, no question about it, and I apologize unreservedly.

Here's the problem, though. Aside from my friends the open scientists (and not even all of them, to be honest), practically all the data-producing researchers I know are firmly stuck on Step 1. Firmly stuck, not to say "immovably." As for Step 2… trust me, these folks are not data modellers. I sincerely doubt my own capacity to teach RDF to someone who approaches me asking, "Is it okay if I record my data in Excel?"

Noting that I have been a longtime RDF skeptic so that you all can discount my peculiar biases, I will say that this disconnect between Linked Data proponents and Joe Q. Researcher concerns me a great deal, mirroring as it does the prior disconnect between RDF advocates and web programmers and content producers, a disconnect that has thus far prevented RDF from becoming common currency on the web.

The bar is too high, folks. It is too high. For my part, I'm starting somewhere both simpler and more complex: working on convincing people that exposing data in any form, emphatically including Excel, is a worthwhile thing to try.

11 responses so far

On NSF data plans

May 11 2010 Published by under Praxis

Word on the street is that the NSF is planning to ask all grant applicants to submit data-management plans, possibly (though not certainly) starting this fall.

Fellow SciBlings the Reveres believe this heralds a new era of open data. I'm not so sanguine, at least not yet. Open data may be the eventual goal; I certainly hope it is. At this juncture, though, the NSF would be stupid to issue a blanket demand for it, and I rather suspect the NSF is not stupid.

Part of the problem, of course, is that many disciplinary cultures are simply not ready for even the idea of open data. If the NSF were to mandate it, these disciplines would revolt openly, tossing lots of "government interference in science" rhetoric around. Moreover, disciplines that are hand-in-glove with industry would lead the charge, with industry's big bucks to back them up. I hear quite a lot about industry strongarming academic scientists into considering nearly everything, emphatically including data, a "trade secret."

(Lest anyone think this type of reaction is limited to the sciences, I ask you to recall the kerfuffle at Iowa over electronic theses, spearheaded by the creative-writing department.)

Another part of the problem is that many, perhaps most, scientists who are ready for the idea of open data are emphatically unprepared for its praxis. It's beyond doubt that data management will be extra work for most of these people, given how sloppy and ad-hoc many data practices are; as the NIH Public Access Policy demonstrates, adding to a researcher's workload must be done with extreme circumspection.

The NSF can't hand down guidelines from on high. Blanket "here is how you deal with your data" demands will not work, given the quantity, variability, and variable sensitivity of data across the scientific enterprise. Data standards? Data standards don't exist for the entirety of science (never mind metadata standards), and not even the NSF can wave a magic wand to call them into existence. Rather cleverly, then, the NSF is planning to say "We don't necessarily know how to deal with your data, but we expect you to think about it and do the right thing."

So if you think you might be affected by this rule if it comes to pass, what should you do? Here's what I think.

  • Do not try to revamp every single process and procedure you have. Do not try to "rescue" all your old data all at once. You will swamp yourself and get discouraged. Seriously, don't. Panic won't help you here.
  • Instead, look back at your last funded project, since it will be freshest in your mind. What data did it produce?
  • What happened to that dataset in the course of your research? Did you run programs against it? Be prepared to archive and document that code.
  • Who handled your data? Did they document it? Where? If there is any part of the process you're fuzzy on, be aware that this fuzziness will need to go away for your next project.
  • Ask yourself the famous ten questions (PDF) about your data. The answers will inform your data-management plan.
  • What can't you do for your data that you think should be done? Need partners? Go find them now. Depending on your needs, the right partners may be in your campus library or IT organization, or they may exist at your funder or in a research center near you.

That should keep you out of trouble for a while! It will also mean that you are prepared come the next funding cycle, where many would-be grantees won't be. In today's cutthroat funding environment, that can only help.

No responses yet

Data longa, tractatus brevis

Apr 05 2010 Published by under Praxis

Dan Cohen has an extraordinarily worthwhile post recounting his talk at the Shape of Things to Come conference at Virginia (which I kept my eye on via Twitter; it looked like a good 'un).

I see no point in rehashing his post; Dan knows whereof he speaks and expresses himself with a lucidity I can't match. I did want to pick up on one piece toward the end, because it has implications for library and archival systems design:

Christine Madsen has made this weekend the important point that the separation of interface and data makes sustainability models easier to imagine (and suggests a new role for libraries). If art is long and life is short, data is longish and user interfaces are fleeting. Just look at how many digital humanities projects that rely on Flash are about to become useless on millions of iPads.

As I've had occasion to mention, scholars generally and humanists in particular have a terrible habit of chasing the shiny. If Dan's post helps lead to an ethic of "sustainable first, shiny later," I will be a very, very happy camper. (I note that Dan's shop has firsthand experience with losing older projects to the shiny—non-standardized Javascript, if I recall correctly. Dan speaks from a position of hard-earned wisdom!)

The answer to this conundrum is not, however, "avoid the shiny at all costs!" It can't be. That will only turn scholars away from archiving and archivists. To my mind, this means that our systems have to take in the data and make it as easy as possible for scholars to build shiny on top of it. When the shiny tarnishes, as it inevitably will, the data will still be there, for someone else to build something perhaps even shinier.

Mark me well, incidentally: it is unreasonable and unsustainable to expect data archivists to build a whole lot of project-specific shiny stuff. You don't want your data archivists spending their precious development cycles doing that! You want your archivists bothering about machine replacement cycles, geographically-dispersed backups, standards, metadata, access rights, file formats, auditing and repair, and all that good work.

So this implies a fairly sharp separation between the data-management applications under the control of the data archivists, and the shiny userspace applications under the control of the scholars. How many of our systems have, or even imply, such separation?

DSpace doesn't, to my everlasting annoyance. (Try building a userspace application on top of materials in DSpace but wholly outside it. Just try.) Omeka doesn't—sorry, Dan. Not Greenstone, not EPrints, not ContentDM, not any of the EAD systems out there, not DLXS. All of these are built as silos, their APIs somewhat to appallingly limited. I'm here to say, the data silo needs to die, and the sooner the better.

Fedora Commons has this right. I say again: for all its faults, and it has them, Fedora Commons has this piece right. I also like what I see coming out of places like the Library of Congress, the California Digital Library, and the University of North Texas.

But let's stick with Fedora, because it's what I know best. Fedora isn't even trying to be the whole silo; it punts on the userspace problem entirely. It doesn't have a web user interface that anyone other than a command-line addict would recognize. What it has is a reasonably comprehensive (and improving) API on which any number of interfaces can be built.

Since "any number" is the exact number of interfaces that will need to be built (and coexist) over wildly varying data… you see why I think this the right approach. If you want to see this approach in action, you need seek no further than Islandora and its Virtual Research Environments.

Here's the fun bit: it doesn't take the University of Prince Edward Island's developers to create a new VRE. Any Drupal dev willing to learn about Fedora's view of the universe and reverse-engineer some of UPEI's code can do it. That's a fair few devs.

And that's the way the world will have to be. Data longa, tractatus brevis.

No responses yet

Societies and science

Mar 17 2010 Published by under Praxis

John Dupuis asks some provocative questions; I thought I'd take a stab at answering them, and I encourage fellow SciBlings to do likewise.

I quite agree with John when he says that the ferment over publishing models disguises a larger question, "the role of scholarly and professional societies in a changing publishing and social networking landscape." My own history with professional societies, I think, bears this out nicely.

John asks first: What societies do you belong to?

I belong to the American Society for Information Science and Technology. I was a member of the American Library Association for a time as a library-school student, until unchallenged racist statements from its then-president Michael Gorman made me reconsider ALA's value proposition; I wound up dropping the membership.

I am also a member/supporter of the Creative Commons and the Electronic Frontier Foundation (which reminds me that it's about time I kicked another donation over to the latter). These aren't scholarly or professional societies in the sense John means, but I invite you to consider two things. One is, of course, that professional societies are competing with advocacy groups like CC and EFF for my money, attention, and time. The second: an often-rehearsed refrain justifying joining ALA in particular is the lobbyists that ALA sponsors in Washington, and the other advocacy and education work that ALA does.

I'm not knocking that work. In fact, if I could donate to ALA's Office of Information Technology Policy (makers of the highly useful Copyright Slider, among other things) and be assured that every penny of my donation would go to OITP's work, I would gladly do that. I'm happy to support advocacy I believe in. I just want to do it without having to support ALA per se, which I don't particularly believe in as presently constituted.

Next question: What value do you get from your membership?

For a while, I had a pretty good streak going of one ASIST-sponsored conference per year. That streak ended last year, but it's as likely as not to pick up again; of the major library and info-sci organizations, the likeliest one to sponsor a conference I'm interested in (and thus cut me a break on conference fees) is ASIST. (ACM is competitive in this regard, but they lost any chance of hooking me when they played games with Harvard over its OA policy. You can stop sending me marketing materials now, ACM. You lose.)

There is also professional-identity value in an ASIST membership. It's a signifier; it signals not only that I'm serious about my profession, but what elements of the profession I'm serious about. Not a few librarians belong to ALA and some of its subsidiary organizations for similar reasons.

Value I don't get from ASIST includes professional-networking value; I do just fine for myself on the interwebs. Because I'm not tenure-track, I also don't have service obligations required of me. If I did, ASIST would unquestionably be the outlet for my labor. Again, the need to demonstrate national-level service is a motivation for many academic librarians who are tenure-track.

I'm also not particularly invested in ASIST's publications. JASIST contains eggheadery on a level I simply can't rise to, and the Bulletin isn't in my experience terribly interesting.

Third question: Is how you're thinking about your membership and the society's role in your professional life changing?

Not noticeably, but I haven't been in the profession all that long, so it hasn't had much time to change, has it? I will say that I expect personal value out of my ASIST membership that I don't expect from CC and EFF. All I expect CC and EFF to do is keep on keepin' on with their missions, without wasting money (which they don't) or creating huge mission-unrelated scandals (which they haven't). At such time as the signifier value of an ASIST membership drops significantly for me, that membership may be in trouble.

Does this mean that scholarly/professional societies need to think harder about what they do instead of what they are? Quite possibly. Instead of esse quam videri (yes, I grew up in North Carolina), facere quam esse. I'm happier to throw money at doing than at being.

John saves the best for last: Do you think societies should be in the scholarly publishing business?

Oof. That's a loaded question, because it's different from the question should scholarly societies publish journals? I am on record as saying that societies have no particular right to fund their non-publishing activities from their publishing activities at the expense of library budgets. I still believe that.

Still, scholarly societies are in a good place to mobilize much of the labor that underpins journal publishing. The authors, peer reviewers, and acquisitions editors pretty much come to them! (Per this just-out D-Lib editorial, that's 80% of the total labor cost of journal publishing anyway. Admittedly, that's a bit of a red herring, because all the shouting is really over the other 20%; there are other eyebrow-raisers in that editorial, but let that go for now.) It would be a shame to lose that, and my sense is that online networking cannot presently replace it because of the low participation in online networking by academia generally (with exceptions, of course).

However, I also believe that any journal-publishing operation needs to operate responsibly. In the present environment, it is irresponsible not to use the Internet to reach the widest possible audience. (There are exceptions, but they are vanishingly few.) It is irresponsible to withhold uncompensated knowledge from emerging nations, from non-profit organizations, from practitioners, from governments, from anyone who could benefit from it but cannot pay out-of-pocket for it and does not (for whatever reason) have a proxy such as a library available. It is irresponsible to operate exclusively in the digital world without strong preservation plans in place. It is irresponsible to charge fees or to allow one's publishing partners to charge fees (no matter the business model; this goes for author-side fees as well as subscription charges) that wildly exceed the true out-of-pocket costs of publication.

A whacking lot of society publishers are flagrantly irresponsible by the above criteria. Should they be publishing? I'll say "no." Not until they can get their heads back on straight. If that means they fold, because they put all their revenue eggs in the subscription-journal basket—I'm not unhappy with that outcome. Whatever they did that is still necessary will resurface; that I believe.

Hope these are the kinds of answers you're interested in, John.

One response so far

« Newer posts Older posts »