Computing variables and pro-rating in SPSS

The basics

SPSS allows you to compute new variables, based on existing ones. This is really useful if, for instance, you want to create a total score for a psychometric scale or other questionnaire.

⚠️ Be sure to read this post on assumption checks you should perform.

You access the Compute Variable dialogue box from the Transform menu…

Here’s what each part of the window is for…

Now let’s say that you want to simply add up three variables to make a total score. You just need to type a sensible name in the left-hand column, and then pace each of the variables in the right-hand ‘Numeric Expression’ box, with a plus between them. (Just like in normal maths, the + tells SPSS to add them up.)

You might find that the SPSS Output window pops up, showing you the code it ran in the background when you clicked ‘OK’ but you can ignore that. Go back to your dataset. At the end, on the right, you’ll find a new column, with the name you just created, and your calculation carried out.

Pro-rating for missing data

This works great for many things, but it does have one drawback. What if you have missing data? For instance, often when we run a survey study, some participants don’t want to answer certain questions, or they are having a careless moment and just skip over one or two questions.

If you have a missing value amongst the values you’re trying to add up, SPSS will refuse to add them up, because, after all, it wouldn’t be a true answer for any participant who has a missing value.

Why does SPSS do this? Well, imagine that we’re adding up 10 questions on a depression scale and that each question might get an answer from 0 (happy) to 10 (depressed). What if Jo Bloggs answers a 10 for half the questions, but then fails to answer the rest? As a human being looking at these scores we might think that Jo is very depressed (maximum score on the ones they did answer) and perhaps even so distressed they couldn’t bring themselves to answer the rest of the questions. If you just added up a total score, though, Jo would get 50 out of a hundred, because of those five questions they left blank, and we might wrongly think they were middling on our depression scale. To prevent us from making this rookie mkstake, SPSS refuses to do the calculation when there are missing values.

One way around this is to calculate means instead of totals. SPSS understands that means are still meaningful (forgive the pun) even when we have some missing data. For instance, if we had taken a mean of Jo’s five answers, we would have got 10 (depressed) overall, despite the five missing values. Thus, SPSS ignores missing values and just storms ahead and calculates means whenever we ask it to. Here’s how we ask it to calculate a mean using syntax…

COMPUTE OurNewVariable = MEAN(var1, var2, var3).
EXECUTE.

Notice that we use the function MEAN() and we put our variables inside the parentheses, separated by commas. (The spaces after the commas are optional, but they do make things look nicer.) You can put as many variables inside the parentheses as you need to, so long as you put a comma between them.

If you want to do this with the graphical user interface, you just put the MEAN(var1, var2, var3) bit in the ‘Numeric Expression’ box.

You can see from the screenshot just above, SPSS has calculated a mean, even though it wouldn’t calculate a total for the second participant in my fake dataset.

Even though it’s useful to know that SPSS does still calculate a mean, this can cause the opposite problem, that it does calculate a mean even when you have so much missing data for a participant that you really shoudn’t be!

In psychology, when we have missing values on a single psychometric scale (a measure of a single construct), it’s fairly usual to prorate (or pro-rate). The usual way to do prorating, before the complete adoption of statistical software, was to take the mean of all the items on a scale (or sub-scale) that the participant did answer, and put that mean in place of each missing value. E.g. if Alex answered 1, 5, __, 3, then you’d take a mean of 1, 5, and 3. You’d calculate that it’s 3, and so you’d enter ‘3’ into the blank cell. That method works, but it’s very time consuming and it also has some drawbacks. For instance, there are some analyses (like Cronbach’s alpha) where we shouldn’t ever use prorated scores.

There’s no need to use the manual time-consuming method in SPSS because you can simply calculate the mean of a group of items even when there are missing values, and the outcome is mathematically equivalent to manual pro-rating. It’s also a heck of a lot easier and quicker and less prone to human error.

However, SPSS will still do this even if most of the data for a participant are missing, and in that case, obviously it would make no sense. Most psychologists set a limit of 20% for missing data that can be ‘replaced’ or ‘pro-rated’. If you have more missing data, the results you’ll get from that participant are much less likely to be reliable. Pro-rating is still somewhat controversial amongst psychologists, even though many have considered it usual practice for decades. My take is that the evidence is good enough to suggest that pro-rating for even 20% missing data gives sufficiently accurate estimates of most constructs that we need not worry about it too much. See here and here.

Remember though that logically, a scale with lower internal consistency (i.e. its items don’t correlate highly amongst themselves) will produce less reliable estimates after pro-rating for missing data because items are less perfect stand-ins for each other. And indeed there is evidence from simulation studies which suggests that pro-rating, as a procedure, makes unrealistic assumptions about the structure of psychological data and may therefore produce biased estimations. If you’re completing an undergraduate dissertation, pro-rating is almost certainly enough.

If you’re long beyond undergrad level, don’t forget to read this post on assumption checks you should perform.

Pro-rating for only a certain percentage of missing data

Even if we choose the simple ‘calculate the mean for me’ approach in SPSS, a problem arises in that SPSS still calculates the mean, even when there are so many missing values for a participant that it would be silly to do so. However, you can prevent SPSS from doing this. In the Compute Variable dialogue (or in Syntax, if you’re geeky) you can add a number to the MEAN() function to tell SPSS how many valid entries there must be for a participant before it should calculate a mean.

Let’s say I’m working with a scale that has ten items, and I have decided that it’s OK to pro-rate up to 20% of my data. That’s equivalent to saying I don’t mind if participants have two answers missing from this scale, and so, I need to tell SPSS to only calculate the mean where there are 8 or more valid answers. Instead of typing MEAN() I therefore type MEAN.8().

MEAN.8(Vbl1, Vbl2, Vbl3, Vbl4, Vbl5, Vbl6, Vbl7, Vbl8, Vbl9, Vbl10)

Last but not least, if for some reason you really wanted total scores, instead of means, you can still use the MEAN() function. You simply need to multiply by k (the number of items in the scale or sub-scale), like this…

MEAN.8(Vbl1, Vbl2, Vbl3, Vbl4, Vbl5, Vbl6, Vbl7, Vbl8, Vbl9, Vbl10) * 10