Good science relies on good record-keeping. Most often, in psychology, record-keeping comes in one of two forms:

1. General lab or field notes.
2. A formatted dataset.

Both are crucial. This document suggests some conventions for dataset creation and use based on the most popular approaches in data science. It’s also based on my own preferences and experience and is unapologetically opinionated. It is my contention that following this guide will generally lead to datasets that are easily understood by other scientists, including collaborators.

## Naming variables

1. Never put spaces or special characters in variable names. Different statistical packages deal with variable names differently. Some cannot handle such characters (e.g. SPSS hates spaces) and you’ll make hours of work for yourself if you ever need a different package. There are many options for formatting variable names, but consistency is key. The point of the formatting is to separate each part of the naming hierarchy for clear readability. Most users of R, and many users of SPSS, prefer the underscore ( _ ) as a separator. (We don’t advocate dot separation because dots mean something in Python, and occasionally Python can do something that R and SPSS can’t. Renaming your variables because you need to run a single line of Python code doesn’t lead to a fulfilling life.) Stick to lowercase. Always. Some languages and apps are case sensitive and wondering why you can’t find your Total_Score variable is very irritating.
2. If you have a group of related variables, put the highest element of the hierarchy at the start of the name, and so on. For instance, if we had items on Hospital Anxiety and Depression Scale, we would name them hads_q1, hads_q2hads_qn. The ‘q’ is helpful if you later want to list just the variables that relate to individual questions. You can search for hads_q* to get a list of all questions. Total scores for depression and anxiety might be hads_dep_tot and hads_anx_tot. Three-letter codes are generally sufficient to explain each element in a name, but if a measure already has a four- or five-letter acronym, use it. Other people will be looking at your dataset at some point, so if there’s already an established convention, stick to it to make their lives easier.
3. If you have data at multiple time points, put the time point first. This follows the same logic as above. The item is nested within the measure, and the measure is nested within the whole questionnaire administered at a given time. Hence, time point is highest in the hierarchy. Hence, t1_hads_q1 and not hads_t1_q1.
4. If items need reverse scoring or rescaling, never do this in the original variable. To see why, consider reverse-scoring a Likert-type scale with response options 1 to 4:
1 → 4
2 → 3
3 → 2
4 → 1
Now imagine you accidentally run the same function again. Your fours become ones, then they become fours again. And your variable retains no record of the fact that you screwed up. Reverse scores should be named with r or rev on the end, and rescaled variables should typically have s or scaled on the end. Z-score transformed variables are usefully identified if they end in z.