Escaping Datageddon: comments, please

Aug 31 2010 Published by under Praxis

I'm due to give an introductory talk on data management to a group of graduate students later this fall. Since I like to steal from the best, I cribbed heavily from MIT's most excellent guide on the subject, particularly their slidedeck, but I thought I could perhaps improve a bit on that deck's organization, as well as cut down some on the information firehose without losing the main points.

I consider this still under construction, so feedback is most welcome.

Escaping Datageddon

  • (another) former academic says:

    Units! For the love of Jesus, Mary and Jehosphat! If you do nothing else, put the units in the variable names of your dataset!

    If you're looking for a fun excercise, have everyone bring a dataset that they think is ready for archiving to the workshop. Swap these around, and see if the new person can do some simple exploratory analysis and report back to the group on what the dataset are about and what they show.

    You know the old joke about learning good manners from those that don't have any? same goes for metadata. Only after you've beaten your head against its absence do you understand why its essential.

    I realize your slide deck is a high-level talk targeted at people with scads of data to manage. For those with smaller projects, or less organized departments, allow me to suggest a few simple starting points.

    For Everyone: If you create an excel file, make a tab called metadata. list your name, the date, and the origins and purpose of all the other tabs. That way if you share the file with someone else and then they share it further, people know who to talk to for clarifications.

    For PIs: Every Thesis/Dissertation should have an appendix at the back listing the raw data points. This is especially handy for undergraduate and masters' theses, as they are unlikely to be published elsewhere.

    For Students: Everytime you finish a project export the final version of your data - the one you actually used in your report - to a text delimited file. Write a read me file that explains what the data are and put both in a folder called 'core data'. voila - your own personal data repository that you can move around with you from one institution to the next. In my experience this turns out to be far smaller and more manageable than a hard drive called 'grad school stuff'.

  • (another) former academic says:

    If you will permit a second (and more helpful) thought ...

    The slide deck focuses on what to keep. But archiving is as much about what to throw away as what to keep. The naive response of many graduate students to slide 15 will be to just keep everything they create (data.xls, finaldata.xls, finisheddata.xls) in one giant folder. Selecting whats important and taking the time to write really good metadata is the hard part, and you may want to spent more time there.