In many of the data-curation talks and discussions I've attended, a distinction has been drawn between Big Science and small science, the latter sometimes being lumped with humanities research. I'm not sure this distinction completely holds up in practice—are the quantitative social sciences Big or small? what about medicine?—but there's definitely food for thought there.
Big Science produces big, basically homogeneous data from single research projects, on the order of terabytes in short timeframes. For Big Data, building enough reliable storage is a big deal; it's hard to even look at the rest of the problem until the storage piece is solved. Some in the data curation space focus unabashedly exclusively on Big Science—Lee Dirks's well-constructed and lucid talk at Harvard yesterday hinted that he is one of these. Standards for data tend to grow fairly quickly in Big Science environments, both de facto (because there's only one source for the data!) and de jure (as in astronomy, which is a fascinating story I'm not quite competent to tell).
Big Science also has big money. It can't be done at all otherwise. The corollary to big money is big teams of researchers and allies.
Small science is what those of us who work at colleges and universities are more accustomed to. Grants are small if they exist at all; research is generally a solo or single-lab endeavor. Research procedures are often ad-hoc, invented by the researcher like Minerva springing from the head of Jove. Data standards do not exist; as often as not, there isn't a critical mass of people doing similar enough work and willing enough to share data to come together to create a data standard.
It has been asserted that small science, taken as a whole, is likely to create more research data than Big Science. When I tracked this assertion toward its source some time ago, the source turned out to be an otherwise-unsupported statement in the Chronicle of Higher Education (can't link; article behind paywall). So I give you this assertion despite not having any proof for it other than intuition. It is intuitive: Big Science accounts for few researchers owing to its expense; small science is a horde, comparatively. Many small datasets add up startlingly fast, partly because storage for each one is less of an immediate issue, partly because the fundedness or Bigness of a science is not necessarily a good measure of its data requirements. (Any research creating high-def digital video in quantity right now is stuck in just as nasty a storage problem as Big Science.)
When I look at business models and processes for data curation, honestly storage is the least-interesting aspect of the problem to me. Partly this is privilege talking: where I work, the intricacies of digital storage are Somebody Else's Problem. All I have to do is find stuff to fill it up! Partly it's consciousness that this problem is absolutely being actively worked on—watch Dirks's presentation for examples. I have faith that the storage problem will be decently managed.
Mostly, though, it's that I'm a librarian, not a sysadmin. The problems that interest me about data are the description, discovery, format, interoperability, and human problems. And I can see a serious, scary human problem lurking under the Big-versus-small science question.
I'm going to hold it as axiomatic that on some level, all of the data arising from the research enterprise are equal in importance, at least potentially. We can't know a priori which researcher studying which phenomenon in which institution will produce data that make possible a startling insight. We triply can't know this a priori because of aftermarket (so to speak) data mashups. The original experiment may have been a bust, or the original observation apparently uninteresting, but just combine those data with other data and watch them fly!
It does not seem, though, that under the data regimes emerging, all data will receive equal care. Even within our own institutions, them that has the gold will make the rules, as "cost recovery" becomes the order of the day. Big Science has the gold. Small science doesn't, and neither do the humanities.
I wonder whether cost-recovery institutional cyberinfrastructure will manage to survive, honestly. (I hasten to say I don't know that it will fail, but I have misgivings.) Big Science has a history of funding and managing its own research-related services, even to running its own libraries. Why would data curation be the exception? Arguably it should be because of the long-term, past-grant-expiration sustainability requirement, but I don't think that argument has ever stopped Big Science before. So where are cost-recovery ops going to recoup their costs? Small science can't pay. And how is cost-recovery a viable business model for data that has to survive lean grant times, anyway?
There's a scale problem involved, too. Because Big Science creates lots of basically homogeneous data, once you're past the storage problem, the other problems are fairly efficient to solve. Once you've sorted out how to describe Big Science data, the procedures can be institutionalized, solved en masse over the entire project. Set it and forget it. Human-resource cost per terabyte of data: minimal, even absurdly small.
Small science, by comparison, creates lots of little pieces of highly heterogeneous data. Without standards, each piece will need individual attention if it is to be adequately described and future-proofed. Human-resource cost per terabyte of data: frightening. Certainly, some of these data will be relatively simple to cope with, and I do expect standards and practices to improve generally; it won't always be necessary to explain the idea of metadata to people. Even so—this is high-touch, high-expense work, even when the actual storage requirements are minimal!
Where is the money to come from? I don't know. Until we all interrogate some of the assumptions underlying our business models, however, we won't be able to advance equitable solutions to the data-curation problem.