Archive for: August, 2009

The dangers of intertwingularity

Aug 31 2009 Published by under Praxis

When I was but a young digital preservationist, I was presented with an archival problem I couldn't solve.

This should not sound unusual. It happens a lot, for all sorts of reasons. If I can keep a few people from falling into traps that make digital preservationists throw up their hands in despair, I'm happy.

Anyway, the problem was a website with some interactions coded in Javascript. If those interactions didn't work, the site made significantly less sense. (It could have been worse; even without the Javascript, the materials on the site were still reachable.)

The Javascript had been coded pre-ECMA standardization. Some of it was obsolete, so obsolete it just didn't work any more in modern browsers. Neither did the site.

I am not a Javascript programmer, so I had to turn down archiving the site. I wasn't happy about it, but sometimes life is like that.

It's always dangerous to intertwingle content and presentation. (That doesn't mean it's not sometimes necessary, of course… but necessity doesn't obviate the danger.) It's an order of magnitude more dangerous to intertwingle content, presentation, and behavior. Data outlasts code!

This has some implications for the data deluge. Consider, for example, the humble Excel spreadsheet, that common workhorse of data management. (Stop sneering, you statistics types with your fancy tools, and you database admins can hush too.) There's no behavior in an Excel spreadsheet, you may say; where's the problem?

Used a function anywhere in your spreadsheet? That's behavior, embedded right there inside your data where you least want it. Function definitions change among versions of Excel, and heaven help you if you move from Excel to Apple Numbers or OpenOffice Calc. Will your results still look the way they did when you first wrote the function? Who knows?

Built a chart or graph anywhere in your spreadsheet? Same problem, only more so.

On a slightly more abstract level, what's happening is that you're allowing your data analysis to rely on code that you didn't write, don't control, and can't document. This is obviously not ideal for long-term use of the data.

Disentangling behavior from data is very, very far from simple. Looking at this from the point of view of a would-be institutional-data librarian, I am flatly terrified by the variety of data that may come to my doorstep, and the concomitant explosion of behaviors that I may be expected to code and support.

I don't have an answer… but all of us who love data need to be asking these questions.

5 responses so far

If not now, when?

Aug 27 2009 Published by under Tactics

I said awhile ago that we don't know who's going to do data curation yet. I absolutely believe that.

I probably should have added, though, that we can have a pretty good idea who's not going to do it: anybody who isn't right this very minute planning to do it.

Make no mistake, there's money (from funders and institutions) and hard-won relevance to be had in this line of work. Quite a few people and organizations are eyeing it: IT, libraries, scholarly societies, journals, entrepreneurs.

If you want to get into the scrum, if you want a piece of the pie, better get your plan on now. This is no time for analysis paralysis. Research workflows have a lot of built-in inertia, so the first halfway-viable solution is extremely likely to win.

This doesn't mean you have to solve every problem in the universe. It does mean you need to look at the problem space, sort out what chunks of it you can solve, and stake your claim to them. Reports, strategic planning processes, elbow-rubbing in high places, whatever it takes.

I'm biased. I want libraries in on this game. But that means we—academic librarianship, I mean—we have got to get moving, because the data won't wait.

No responses yet

Last push for Louisville Free Public Library

Aug 27 2009 Published by under Miscellanea

Steve Lawson and the LSW are three-fifths of the way to the goal of $5000 for the flood-ravaged Louisville Free Public Library by September 1.

The last two-fifths are the hard part. If you can help, please do.

Comment here or send me email (dorothea.salo at gmail) to let me know you've donated, and I'll do a random-number drawing for a PLoS travel mug and a size-large, never-worn PLoS One t-shirt.

Thanks.

2 responses so far

The humble index

Aug 25 2009 Published by under Uncategorized

I'd like to start our tour of book and library information-management techniques with a glance at the humble back-of-book index. I started the USDA's excellent indexing course back in the day, and while it became clear fairly quickly that I do not have the chops to be a good indexer and so I never finished the course, I surely learned to respect those who do have indexing chops. It's not an easy job.

Go find a book with an index and flip through it. Seriously, go ahead. I'll wait. Just bask in the lovely indentedness and order of it all.

Now answer me a question: Should Google be calling that huge mass of crawled web data it computes upon an index?

Arguably, it shouldn't, though this is absolutely a lost battle; the word "index" is polysemous and always will be. What Google has is more along the lines of a concordance of the web. What's a concordance, you ask? A list of words in a given corpus of text, along with pointers to where those words are used in the corpus. Way back in the day, compiling concordances to important literature (e.g. religious texts) was considered worthy scholarly work. Today, of course, feeding a text into a computer can yield a concordance in seconds—I'm no great shakes as a programmer, but even I could hack up some concordance software if I had to.

Google's index is a bit more than a straight-up concordance: they do stemming and some n-gram analysis and other fancy-pants tricks. But it is still qualitatively different from a back-of-book index. How? I'll adduce three major differences: human intervention, terminological insularity, and intentional grouping.

There is a standard documenting what an index is for and how to create one. I'm not paying over $250 to own it, but I'll happily give you the gist.

An indexer presented with a book reads it at least twice, with concentrated attention. She is looking for concepts that the book treats usefully and/or in some depth, because an index containing every passing mention of everything is usually useless to someone asking "does this book have useful, original information on topic X?"

(I did say "usually." Sometimes a topic is so terribly remote or abstruse that even the slightest mention is useful. That's when a concordance can be superior to an index. Google Books is a godsend to lovers of minutiae.)

Please note that I said concepts, not "words" or even "phrases." A recurring problem in information management is that human language is truly annoying about using different words for the same thing, in various sorts of ways that this post is already too long to discuss in depth. Suffice to say that part of the indexer's job is to tease out concepts in the text that aren't necessarily labeled consistently or even labeled at all. A text on web design may never actually use the word "usability," for example, but that doesn't mean it has nothing to say about the subject! A good indexer will work that out.

So how does an indexer label the concepts she finds? Well, ideally, the text has done that for her; that's why an index is more insular than Google, which makes considerable use of other people's labels for web pages insofar as those are discoverable through links. (That's what Googlebombing is all about, of course.) The indexer is not slavishly bound to the text's language, however. She is allowed to take into account the text's readers, and (what she believes to be) their language use.

An indexer will not lightly discard the text's usage. What she will do is use "See" entries to connect likely reader usage to the text's usage. If the aforementioned web-design text casually throws in "HCI" without ever expanding it (shame on the editor! but it does happen), a smart indexer will throw in an entry "Human-computer interaction. See HCI." Remember this trick. We will see it in other forms later.

A See entry is not the same as a See also entry. See entries are intended for more-or-less synonymous terms. Rather than wastefully repeat the entire litany of page numbers for every synonym of a given term, pick the likeliest term (probably the text's most-often-used term, but again, the indexer has some discretion) and point the other synonyms to it. See also entries are for related terms, other information that in the indexer's judgment a reader might be interested in.

See also entries are another example of the grouping function of an index, alongside the entire idea of bringing mentions of the same concept that have been scattered throughout the text together in a single entry. Google does not do this save by haphazard accident. A few other search engines try (have a look at Clusty), but the results typically don't make entire sense—and why should they? they're using algorithms to do a human's job!

Purely mechanical questions such as page count enter into index compilation as well; publishers reserve a certain number of pages for the index (or in the hurly-burly of typesetting, a certain number of pages become available), and the index must be chopped to fit. You can imagine, I'm sure, that it's much harder to do a short index than a longer one!

Indexing electronic books introduces user-interface and workflow questions. The print book has the immensely convenient "page" construct to use as a pointer. The electronic book may have pages—or it may scroll, or the page boundaries may change according to font size, or… you see the UI problem, I trust. It's not insoluble, but it's annoying. The workflow problem is simple: how (and when in the production process) does the poor indexer mark the places a given entry should point to?

When I was doing ebooks back in the day, these problems hadn't been solved yet. I worry sometimes that if they remain unsolved, the noble art of book indexing will wither and die—and the search engine, as I hope you now understand, is not an entire replacement.

Go back and flip through that book index again. Appreciate it a little more? Excellent.

8 responses so far

Tidbits, 24 August 2009

Aug 24 2009 Published by under Tidbits

Hello, Monday. My tidbits folder overfloweth.

Have a productive week!

2 responses so far

A little Friday metablogging

Aug 21 2009 Published by under Metablogging

Well, I've been here for about a month now, and I've quite enjoyed myself! (And I finally did send in my contract, Erin. Really. I did.)

Thanks to all who have commented. (Well, except a spammer or two, but I got rid of them posthaste.) You're a civil, engaged, and smart bunch, and I appreciate you very much—especially when you keep me honest.

Please, if you will, introduce yourselves and tell me (and Trogool's other commenters) a bit about yourself in the comments to this post. Thanks!

3 responses so far

Let Them Eat Disk

Aug 20 2009 Published by under Praxis

Many people, first confronted with the idea of data curation, think it's a storage problem. A commonly-expressed notion is "give them enough disk and they'll be fine." Terabyte drives are cheap. Put one on the desk of every researcher, network it, and the problem evaporates, right?

Right?

Let me just ask a few questions about this approach.

  1. What happens when a drive on somebody's desk fails?
  2. What do we do about the astronomers, physicists, and climatologists, who can eat a whole terabyte before breakfast and hardly notice?
  3. What do we do about the social scientists, medical researchers, and others who (necessarily) collect personally-identifiable and/or confidential information and are ethically and often legally forbidden from exposing it to the world?
  4. How do we manage access to the drive on somebody's desk in the case of a collaboration across institutions? Who owns the data then? What about collaborations with industry, where trade-secret law may come into play?
  5. What do we do about certain varieties of lab science where the actual data-generating work is done by graduate students? Do they get terabyte drives too? (With my own two ears I've heard IT professionals say in all seriousness "They're just graduate students; we don't have to worry about their data.")
  6. What happens to somebody's drive and the stuff on it when she retires or moves between institutions?
  7. Is this system going to satisfy grant funders and journals who require data-sharing or data-sustainability plans?
  8. If we assume that the goal for at least some researchers is to make data available to the world at some juncture, who ensures that these drives and the material on them are discoverable (presumably via the Web)? Encoded adequately and sustainably, and in line with disciplinary data standards if any? Data-dictionaried, described, user-interfaced, and in a stable location with a stable identifier? (You can tell me "the researcher will!" I will laugh at you, but you can tell me that, sure.)
  9. The institution owns the drive. Does the institution own the data on it? If not, what can the institution realistically do to shepherd those data?
  10. What happens to somebody's drive when a patent, trade-secret, or copyright lawsuit is in play? ("You can't copyright data!" Hush, young padawan, and think of Europe.)
  11. Who's to say this drive gets used for research data instead of somebody's mp3 collection (or worse)? (Modify the question appropriately for music researchers, of course.)
  12. If the data need to go into a disciplinary or governmental repository of some kind, how does that happen?
  13. Who checks for and deals with bitrot, file-format migration, proprietary file formats, and similar hassles?

Whew. I'm afraid that's more than a few questions. Sorry about that. I hope my point is clear: data curation is a complicated problem! "Let Them Eat Disk" will not solve it; big disk is unquestionably necessary, but far from sufficient.

7 responses so far

The classical librarian

Aug 19 2009 Published by under Praxis

Five years ago (really? goodness, it hardly seems possible) I gave a preconference session at the Extreme Markup Languages conference (which is now Balisage) entitled "Classification, Cataloguing, and Categorization Systems: Past, Present, and Future."

I have learned to write better talk titles since then. However. The talk was actually a runthrough of library standards and practices for an audience of markup wonks. Like any field, librarianship has its share of jargon and history that legitimately seems impenetrable to outsiders.

I'm going to try to reprise some of that talk here in blog form, over time, on the belief that a few more folks understanding how library data operates cannot possibly be a bad thing.

So, then. We start with Robert Graves's I, Claudius, in which Livy says to Pollio

It seems, then, that we may as well abandon all hope of finding it, unless perhaps… why, there's Sulpicius! He'll know if anyone does. Good morning, Sulpicius. I want you to do a favour for Asinius Pollio and myself. There's a book we want to look at, a commentary by a Greek called Polemocles on Polybius's Military Tactics. I seem to remember coming across it here once, but the catalogue does not mention it and the librarians here are perfectly useless.

As you can see, librarians don't get no respect. We're used to it.

Let's look at the problem here. The librarians of the Apollo Library have built some sort of catalogue of their holdings. Graves doesn't tell us how he thinks they organized it—by author? author's ethnicity? language of work? title? if title, title of the commentary, or are commentaries organized under the work commented upon? In any case, however they organized it, Livy expected to find a particular item but didn't, and the librarians couldn't help him for some reason. Maybe the catalogue was incomplete?

Sulpicius's response is utterly delightful.

Sulpicius gnawed his beard for awhile and then said: "You've got the name wrong. Polemocrates was the name and he wasn't a Greek, in spite of his name, but a Jew. Fifteen years ago I remember seeing it on that top shelf, the fourth from the window, right at the back, and the title tag had just 'A Dissertation on Tactics' on it. Let me get it for you. I don't expect it's been moved since then."

So let's recap. Livy had the author and author's ethnicity (and possibly language) wrong, and he couldn't remember the title—but boy, he sure as heck expected the catalogue and the librarians to turn up his book—er, scroll—anyway!

Any reference librarian will tell you that this sort of reference request happens all the time. Graves absolutely nailed it with this anecdote. Respect the reference librarian! I surely do.

How do information-seekers and reference librarians solve such problems nowadays? Answers in the comments, and while you're at it, tell me how well you think the techniques work, and when and why they fail.

(Incidentally, if you're suddenly curious about ancient libraries, I recommend the short, breezy and fun treatment in Lionel Casson's Libraries in the Ancient World.)

One response so far

Please don't do this! A word about keywords

Aug 18 2009 Published by under Praxis

I see a lot of metadata out there in the wild woolly world of repositories. Seriously, a lot. Thesis metadata, article metadata, learning-object metadata, image metadata, metadata about research data, lots of metadata.

And a lot of it is horrible. I'm sorry, it just is—and amateur metadata is, on the whole, worse than most. I clean up the metadata I have cleaning rights to as best I am able, but I am one person and the metadata ocean is frighteningly huge even in my tiny corner of the metadata universe.

So here's a bit of advice that would save me a lot of frustration and effort, and is likely to help the people who really need to read your stuff find it.

When you're doing keywords? Anything that shows up elsewhere in your record is not a keyword, okay?

Authors and other creators are not keywords, save for the rare case that the item is somehow autobiographical. Titles are not keywords. (Really. They're not. They may contain a keyword or two, but that's not the same thing.) Any search engine is going to turn up authors and titles that don't appear in the keyword field; trust me on this one. Likewise, if every single word in the full text of the item is a keyword, then nothing is.

The point of keywording is not to shovel in every single word that someone might conceivably search for. Leave that kind of indexing to Google and other full-text indexing engines. The point of keywording in this day and age is to distinguish this item from all the other items that look vaguely like it, to help folks who arrive there make the snappiest judgment possible about whether this item is what they need.

When you add keywords with a backhoe instead of an eyedropper, you are not raising the chance your item will be read or used. You are lowering it, because most people who arrive at the item will roll their eyes at the lengthy list of keywords and bounce right back out looking for something more targeted.

Keep your keywords to-the-point and as few as possible. This metadata librarian thanks you for it. So will your readers.

6 responses so far

The accidental informaticist

Aug 17 2009 Published by under Tactics

The publisher Information Today runs a good and useful book series for librarians who find themselves with job duties they weren't expecting and don't feel prepared for. There's The Accidental Systems Librarian and The Accidental Library Marketer (that one's new) and a whole raft of other accidents.

I suspect "The Accidental Informaticist" would find an audience, and not just among librarians.

The long and short of it is, we just don't know who is going to do a lot of the e-research gruntwork at this point. Campus IT at major research institutions is seizing on the fun grid-computing work, and they're at least investigating collaboration solutions, but at least some of them seem to be balking pretty hard at providing the big disk necessary for data curation, never mind the human resources necessary to do data curation anything like right.

Having campus IT handle these services can also create a tremendous gap between haves and have-nots. Grant-funded science can pay into cost-recovery operations, which many campus IT shops are. Grant-funded science can hire its own IT if it has to, as well as dedicated informaticists (though admittedly they mostly don't). Anybody who isn't grant-funded science and has data? Is out in the cold.

It's worth noting as well that even grant-funded science doesn't often think past grant expiration. For collaboration tools and grid computing resources, that's fine. For data curation? Not so fine.

Alma Swan, in a report well worth reading, posits four kinds of data-curation staff: data creators, data managers, data librarians, and data scientists. I'm not sure how far I can go with that. I agree with the skillsets as Swan lays them out; I'm just agog at the idea that any institution or research shop will be able to divvy up these tasks among four whole people!

It doesn't help Swan's case in my mind that I myself am half a data librarian and half a data manager. (Swan says that "the boundaries are fuzzy," but I'm not sure there are any boundaries at all!) Munging data to make it flow from one place to another? Been doing that these ten years. Looking after digital data as best I can to keep it usable for the long term? Sure. That's what an institutional-repository manager is for, right?

What I'm afraid of is that Swan has reified the job descriptions too soon, and that eager institutions will say to themselves "this! this is what we need!" before they do the hard work of making internal decisions about which pieces of the e-research puzzle they can and should assemble.

Make no mistake, it's okay not to do everything. If you're going to focus on big science and leave everybody else gasping, that's your choice; it may even be the only choice that pencils out, budget-wise. Personally, I'm of the opinion that big science doesn't need anybody's sloppy help for the most part, and the interesting problems are found elsewhere, but that's me.

The point is, I don't know and neither do you—and neither does Alma. It's too soon.

So how do we confront a big nebulous problem? By opening the door to happy accidents, I think. The consulting model for e-research in place at institutions like Purdue makes a lot more sense to me than some other approaches I've seen. You can do endless surveys or focus groups to figure out what institutional needs are, but there's nothing quite like throwing spaghetti at the wall and seeing what sticks.

Sure, it's a risky strategy. It has the virtue of being cheap, however, compared to buying a lot of big iron before the local need is established. It also assumes that you have the right kind of people on staff—can-do souls comfortable with a lot of uncertainty and able to learn fast. Staffing a speculative consulting operation with the change-averse is a fast road to oblivion.

We're all accidental informaticists. We'll all have to learn by doing. That's okay. Let's do it!

No responses yet

Older posts »