John Unsworth, Illinois. "Idiosyncrasy at scale: digital curation and the digital humanities."
Can't remove ambiguity in the humanities (the way you can in chemistry)! We'd remove everything that matters. This can make it hard to talk about humanities "data" (is there a thermometer for the zeitgeist?). Humanities data are idiosyncratic because the people who make them are.
Research methods are changing as traditional objects of humanities study (e.g. diaries, correspondence) become born-digital. Still have to "tame the mess," recognize that mess has value, including as a mess. Is departure from the norm an "error" or a "data point"?
"Retrieval is the precondition for use; normalization is the precondition for retrieval." (Not sure I agree with this! Techniques exist to deal with messiness.)
Six laws to give us pause:
- Scholars interested in particular texts.
- Tools are only useful if they can be applied to texts of interest.
- No one collection has all texts.
- No two collections are format-identical.
Therefore: humanities data narratives include normalization (of "Frankendata:" broadly aggregated but imperfectly normalized data). Lots of different kinds of normalization (spelling, punctuation, chunking, markup, metadata).
Example: MONK project, using EEBO and ECCO within the CIC. (Me, on soapbox: This. THIS. is the collateral damage from "sustainability" initiatives that impose firewalls around content. If you're not in the CIC, too bad so sad, you can't use these data.) Lots of data-munging which I won't recount.
Example: Hathi Trust, now available through API. Will be central player in developing research uses for digitized texts. Doing preprocessing/normalization blows up storage space necessary by 100x. There will be a research center established for working with this corpus.
Can we crowdsource corrections, a la GalaxyZoo? People are interested and willing, it can't be automated, and we need the help.
How do I keep my solution from becoming your problem? Association for Computers in the Humanities trying to crowdsource some best-practices recommendations for humanities researchers on managing their digital/digitized collections. Immediate conflict on DHAnswers site: to use markup or not to use markup? Practical upshot: when do we have usefully shareable data? When should we stop messing with it so others can use it? What's data and what's data interpretation, and what do we do when they coexist in the same marked-up text?
Humanities data is bigger than books! Books are the tip of the iceberg. NARA strategy for digitizing archival materials: they have 5x the pages of what's in Hathi Trust, in much less tractable forms than the books Google/Hathi is working on. And that's just one archive! We'll have to learn how to manage this kind of scale.