Choosing a statistical test: A cheat sheet

Students who are new to statistics tend to find it tricky to remember which test to use under which circumstances. The following diagram is intended as a decision aid only. There are many many more statistical tests that are not shown here, but these are the basic ones most commonly taught on psychology courses.

Even if you end up finding out that you need a more complex or niche statistical test, an aide memoire like this can still be useful if it reminds of the relevant keywords to search.

Remember, always check the assumptions of the test you choose.

Alternatives to pro-rating for missing data

Behavioural scientists often collect data using multi-item measures of latent constructs. For instance, clinical psychologists measure anxiety and depression using self-report questionnaires composed of multiple psychometric items, whilst child psychologists measure developmental progress by asking parents batteries of questions. Missing data are extremely common on such questionnaires, and one usually finds that data are missing at the item level. In other words, participants miss out items, either by accident, or because they don’t want to answer certain questions. Thus, researchers are left with some data relating to a given construct for a particular participant, but not complete data.

The usual solution is pro-rating, where the mean of the completed items for a given scale (or sub-scale) is taken as the imputed value for any missing items on that scale. If you’re not familiar with the practice, I’ve written a post about it. As Mazza, Enders, and Ruehlman (2015) point out, this pro-rating practice is common across most of the behavioural sciences, and especially in psychology. But, as they and others also discuss, there are potentially serious issues with the practice.

Pro-rating makes a number of assumptions about the structure of the data. To my mind, for pro-rating to be an ideal solution, one needs make the following assumptions:

  1. The domain coverage of the completed items overlaps sufficiently with the domain coverage of the missing items.
    Some scales explicitly sample behaviours from a variety of related domains and therefore breach this assumption. The most obvious examples are scales for mental health conditions which map items to diagnostic criteria. Answering a question about whether you struggle to sleep is a relatively poor stand-in for an item about whether you have lost the ability to enjoy things you once did.
  2. The scale has good internal consistency.
    The basic idea behind pro-rating is that within a scale, items are so highly correlated with each other that they can to some degree stand in for one another. The basic idea therefore falls apart if the scale has poor internal consistency (i.e. poor inter-item correlations).
  3. A high proportion of the items are complete and can be used to calculate the scale score.
    Estimating someone’s IQ from 90% of the items in an intelligence test is one thing, but giving them only 5% of the items and then pro-rating the rest would clearly be inferior. Nobody seems to agree on what the cut-off should be, but pro-rating 20% missing data seems to be routine (Mazza, Enders, and Ruehlman, 2015). Graham (2009) suggests proration might be considered more reasonable the more items used, and that we should never use less than half (!) the scale items.
  4. The mean score on the missing and completed items are similar.
    Put another way, average item difficulty is reasonably well matched across the missing and completed items. See Enders (2010).
  5. Factor loadings or item-total correlations are similar across missing and non-missing items.
    One way to think about this is that we would be introducing more error into the measurement if the missing items were those with the best loadings on the factor. Again, see Graham (2009).

Reading the above list, anyone familiar with a handful of psychometric scales will readily see that these are not always reasonable assumptions. But you will also see that it’s fairly straightforward to check these assumptions.

Pro-rating also causes two notable side-effects, of which researchers should be mindful:

  1. The definition of your scale now varies across participants. It no longer has k items. It has a different number of items relating to the amount of missing data for each participant. If your data are not missing at random (because, say, someone who scores high on a certain construct is unwilling to answer some of the questions relating to that construct) you may have extra problems to worry about.
  2. Pro-rated data will artificially inflate estimates of internal consistency. This is because you have just created values (for participants with missing data) as a linear composite of other values for that participant. For this reason, any estimate of internal consistency should be calculated without pro-rating.

A good many methodologists have therefore suggested that researchers should prefer other methods of dealing with missing data. These other methods are now generally possible in a range of statistical packages, but they are more computationally complex.

More importantly, at the time of going to press, they’re generally not available. SPSS and Jamovi are the two most used statistical apps in psychology. SPSS does include multiple imputation, but Jamovi does not. Neither of them (at time of going to press) includes Full Information Maximum Likelihood options.

I am not a statistician, and nor do I play one on the internet. But I am a working psychological scientist, and in my humble opinion, it is acceptable to make the commonplace and pragmatic decision to use prorating, so long as the above assumptions are checked, and the above warnings are taken into account. It goes without saying that these things should also be addressed in our reporting.


Enders, C. K. (2010). Applied missing data analysis. New York, NY: Guilford Press.

Mazza, G. L., Enders, C. K., & Ruehlman, L. S. (2015). Addressing Item-Level Missing Data: A Comparison of Proration and Full Information Maximum Likelihood Estimation. Multivariate Behavioral Research, 50(5), 504-519. doi:10.1080/00273171.2015.1068157

Computing variables and pro-rating in SPSS

The basics

SPSS allows you to compute new variables, based on existing ones. This is really useful if, for instance, you want to create a total score for a psychometric scale or other questionnaire.

⚠️ Be sure to read this post on assumption checks you should perform.

You access the Compute Variable dialogue box from the Transform menu…

Here’s what each part of the window is for…

Now let’s say that you want to simply add up three variables to make a total score. You just need to type a sensible name in the left-hand column, and then pace each of the variables in the right-hand ‘Numeric Expression’ box, with a plus between them. (Just like in normal maths, the + tells SPSS to add them up.)

You might find that the SPSS Output window pops up, showing you the code it ran in the background when you clicked ‘OK’ but you can ignore that. Go back to your dataset. At the end, on the right, you’ll find a new column, with the name you just created, and your calculation carried out.

Pro-rating for missing data

This works great for many things, but it does have one drawback. What if you have missing data? For instance, often when we run a survey study, some participants don’t want to answer certain questions, or they are having a careless moment and just skip over one or two questions.

If you have a missing value amongst the values you’re trying to add up, SPSS will refuse to add them up, because, after all, it wouldn’t be a true answer for any participant who has a missing value.

Why does SPSS do this? Well, imagine that we’re adding up 10 questions on a depression scale and that each question might get an answer from 0 (happy) to 10 (depressed). What if Jo Bloggs answers a 10 for half the questions, but then fails to answer the rest? As a human being looking at these scores we might think that Jo is very depressed (maximum score on the ones they did answer) and perhaps even so distressed they couldn’t bring themselves to answer the rest of the questions. If you just added up a total score, though, Jo would get 50 out of a hundred, because of those five questions they left blank, and we might wrongly think they were middling on our depression scale. To prevent us from making this rookie mkstake, SPSS refuses to do the calculation when there are missing values.

One way around this is to calculate means instead of totals. SPSS understands that means are still meaningful (forgive the pun) even when we have some missing data. For instance, if we had taken a mean of Jo’s five answers, we would have got 10 (depressed) overall, despite the five missing values. Thus, SPSS ignores missing values and just storms ahead and calculates means whenever we ask it to. Here’s how we ask it to calculate a mean using syntax…

COMPUTE OurNewVariable = MEAN(var1, var2, var3).

Notice that we use the function MEAN() and we put our variables inside the parentheses, separated by commas. (The spaces after the commas are optional, but they do make things look nicer.) You can put as many variables inside the parentheses as you need to, so long as you put a comma between them.

If you want to do this with the graphical user interface, you just put the MEAN(var1, var2, var3) bit in the ‘Numeric Expression’ box.

You can see from the screenshot just above, SPSS has calculated a mean, even though it wouldn’t calculate a total for the second participant in my fake dataset.

Even though it’s useful to know that SPSS does still calculate a mean, this can cause the opposite problem, that it does calculate a mean even when you have so much missing data for a participant that you really shoudn’t be!

In psychology, when we have missing values on a single psychometric scale (a measure of a single construct), it’s fairly usual to prorate (or pro-rate). The usual way to do prorating, before the complete adoption of statistical software, was to take the mean of all the items on a scale (or sub-scale) that the participant did answer, and put that mean in place of each missing value. E.g. if Alex answered 1, 5, __, 3, then you’d take a mean of 1, 5, and 3. You’d calculate that it’s 3, and so you’d enter ‘3’ into the blank cell. That method works, but it’s very time consuming and it also has some drawbacks. For instance, there are some analyses (like Cronbach’s alpha) where we shouldn’t ever use prorated scores.

There’s no need to use the manual time-consuming method in SPSS because you can simply calculate the mean of a group of items even when there are missing values, and the outcome is mathematically equivalent to manual pro-rating. It’s also a heck of a lot easier and quicker and less prone to human error.

However, SPSS will still do this even if most of the data for a participant are missing, and in that case, obviously it would make no sense. Most psychologists set a limit of 20% for missing data that can be ‘replaced’ or ‘pro-rated’. If you have more missing data, the results you’ll get from that participant are much less likely to be reliable. Pro-rating is still somewhat controversial amongst psychologists, even though many have considered it usual practice for decades. My take is that the evidence is good enough to suggest that pro-rating for even 20% missing data gives sufficiently accurate estimates of most constructs that we need not worry about it too much. See here and here.

Remember though that logically, a scale with lower internal consistency (i.e. its items don’t correlate highly amongst themselves) will produce less reliable estimates after pro-rating for missing data because items are less perfect stand-ins for each other. And indeed there is evidence from simulation studies which suggests that pro-rating, as a procedure, makes unrealistic assumptions about the structure of psychological data and may therefore produce biased estimations. If you’re completing an undergraduate dissertation, pro-rating is almost certainly enough.

If you’re long beyond undergrad level, don’t forget to read this post on assumption checks you should perform.

Pro-rating for only a certain percentage of missing data

Even if we choose the simple ‘calculate the mean for me’ approach in SPSS, a problem arises in that SPSS still calculates the mean, even when there are so many missing values for a participant that it would be silly to do so. However, you can prevent SPSS from doing this. In the Compute Variable dialogue (or in Syntax, if you’re geeky) you can add a number to the MEAN() function to tell SPSS how many valid entries there must be for a participant before it should calculate a mean.

Let’s say I’m working with a scale that has ten items, and I have decided that it’s OK to pro-rate up to 20% of my data. That’s equivalent to saying I don’t mind if participants have two answers missing from this scale, and so, I need to tell SPSS to only calculate the mean where there are 8 or more valid answers. Instead of typing MEAN() I therefore type MEAN.8().

MEAN.8(Vbl1, Vbl2, Vbl3, Vbl4, Vbl5, Vbl6, Vbl7, Vbl8, Vbl9, Vbl10)

Last but not least, if for some reason you really wanted total scores, instead of means, you can still use the MEAN() function. You simply need to multiply by k (the number of items in the scale or sub-scale), like this…

MEAN.8(Vbl1, Vbl2, Vbl3, Vbl4, Vbl5, Vbl6, Vbl7, Vbl8, Vbl9, Vbl10) * 10