Formatting datasets

Good science relies on good record-keeping. Most often, in psychology, record-keeping comes in one of two forms:

  1. General lab or field notes.
  2. A formatted dataset.

Both are crucial. This document suggests some conventions for dataset creation and use based on the most popular approaches in data science. It’s also based on my own preferences and experience and is unapologetically opinionated. It is my contention that following this guide will generally lead to datasets that are easily understood by other scientists, including collaborators.

Naming variables

  1. Never put spaces or special characters in variable names. Different statistical packages deal with variable names differently. Some cannot handle such characters (e.g. SPSS hates spaces) and you’ll make hours of work for yourself if you ever need a different package. There are many options for formatting variable names, but consistency is key. The point of the formatting is to separate each part of the naming hierarchy for clear readability. Most users of R, and many users of SPSS, prefer the underscore ( _ ) as a separator. (We don’t advocate dot separation because dots mean something in Python, and occasionally Python can do something that R and SPSS can’t. Renaming your variables because you need to run a single line of Python code doesn’t lead to a fulfilling life.) Stick to lowercase. Always. Some languages and apps are case sensitive and wondering why you can’t find your Total_Score variable is very irritating.
  2. If you have a group of related variables, put the highest element of the hierarchy at the start of the name, and so on. For instance, if we had items on Hospital Anxiety and Depression Scale, we would name them hads_q1, hads_q2hads_qn. The ‘q’ is helpful if you later want to list just the variables that relate to individual questions. You can search for hads_q* to get a list of all questions. Total scores for depression and anxiety might be hads_dep_tot and hads_anx_tot. Three-letter codes are generally sufficient to explain each element in a name, but if a measure already has a four- or five-letter acronym, use it. Other people will be looking at your dataset at some point, so if there’s already an established convention, stick to it to make their lives easier.
  3. If you have data at multiple time points, put the time point first. This follows the same logic as above. The item is nested within the measure, and the measure is nested within the whole questionnaire administered at a given time. Hence, time point is highest in the hierarchy. Hence, t1_hads_q1 and not hads_t1_q1.
  4. If items need reverse scoring or rescaling, never do this in the original variable. To see why, consider reverse-scoring a Likert-type scale with response options 1 to 4:
    1 → 4
    2 → 3
    3 → 2
    4 → 1
    Now imagine you accidentally run the same function again. Your fours become ones, then they become fours again. And your variable retains no record of the fact that you screwed up. Reverse scores should be named with r or rev on the end, and rescaled variables should typically have s or scaled on the end. Z-score transformed variables are usefully identified if they end in z.
  5. Provide meaningful labels to your variables/items. hads_q1 is relatively meaningless unless you are well versed in all items of the HADS. When running certain tests such as exploratory factor analysis, you can ask your software to indicate the label related to the item. This is very helpful when trying to determine what the underlying construct behind your factors might be, without having to recheck your list of items. You may also return to the dataset after having forgotten all about it. Providing Future You with a nice label will mean you won’t have to flick through very old (perhaps incomplete) notes to make heads or tails of the data file.
  6. Consistency of variable names across studies will save you countless hours. This is especially true when you code (in SPSS syntax, or R, or whatever) your variable computations or data wrangling. Even if you don’t think you will, you might need to do this. Researchers often end up using the same piece of experimental software, or the same psychometric measure, more than once. Remember to be consistent with names because if you do use the same measures or experimental code, you won’t need to write new syntax for totals, means, Cronbach’s alphas, subscale scores, and so on. Wouldn’t it be nice if when you do use a certain measure or piece of experimental software, you already have a nice script to do all these things because you have been consistent?
  7. Always have a variable that records participant number. One click of the ‘sort’ button and your rows will all move around. You’ll no longer know who’s who.
  8. Whenever practical, have a variable that records the date of the observation.
  9. If more than one person is collecting data, have a categorical variable that records which experimenter did the observation.
  10. Include a string (plain text) variable where you can record any general (perhaps unexpected) notes relating to the particular observation. “Participant had forgotten to turn off mobile phone. Call came through during testing,” might be a cast iron reason for excluding the participant’s data later on.
  11. If you’re using paper records and electronic ones, make sure there’s a way to match up your paper and electronic data per participant, in case anything gets entered incorrectly or get corrupted. Usually, a simple participant number will suffice, unless there are ethical constraints.

Other stuff

If you use R, you could do worse than to read Hadley Wickham’s style guide: