Computing variables and prorating for missing data in Jamovi and SPSS

Computing total scores

SPSS and Jamovi allow you to compute new variables, based on existing ones. This is really useful if, for instance, you want to create a total score for a psychometric scale or questionnaire. First let’s cover the basics, then we’ll see how to handle missing data in psychometric tests.

SPSS

In SPSS you access the Compute Variable dialogue box from the Transform menu. Let’s say that you want to simply add up three variables to make a total score. You just need to type a sensible name for your new variable in the left-hand column, and then place each of the variables in the right-hand ‘Numeric Expression’ box, with a plus between them. (Just like in normal maths, the + tells SPSS to add them up.)

You might find that the SPSS Output window pops up, showing you the code it ran in the background when you clicked ‘OK’ but you can ignore that. Go back to your dataset. At the end, on the right of your dataset, you’ll find a new column, with the name you just created, and your calculation carried out.

Jamovi

The process in Jamovi is very similar. In the Data tab, click Compute, choose a name for your new variable and put it in the first field, add a description to help you remember what the variable measures, and then type a formula adding up the variables you need.

Prorating for missing data: the concept

The above approach works great for many things, but it does have one drawback. What if you have missing data? For instance, often when we run a survey study, some participants don’t want to answer certain questions, or they are having a careless moment and just skip over one or two questions.

If you have a missing value amongst the values you’re trying to add up, SPSS and Jamovi will refuse to add them up, because, after all, it wouldn’t be a true answer for any participant who has a missing value.

Why? Well, imagine that we’re adding up 10 questions on a depression scale and that each question might get an answer from 0 (happy) to 10 (depressed). What if Jo Bloggs answers a 10 for half the questions, but then fails to answer the rest? As a human being looking at these scores we might think that Jo is very depressed (maximum score on the ones they did answer) and perhaps even so distressed they couldn’t bring themselves to answer the rest of the questions. If you just added up a total score, though, Jo would get 50 out of a hundred, because of those five questions they left blank, and we might wrongly think they were middling on our depression scale. To prevent us from making this rookie mkstake, SPSS and Jamovi refuse to do the calculation when there are missing values.

One way around this is to calculate means instead of totals, and then, if we like, multiply the mean score back up to get the pro-rated total score.

⚠️ Be sure to read this post on assumption checks you should perform, especially if you’re using pro-rating for anything beyond masters level psychometrics.

Means are still meaningful (forgive the pun) even when we have some missing data. For instance, let’s say we measure Jo’s depression on a number of different days. Being very depressed, he always answers the maximum score of 10 for each question. One day he might answer 9 of our questions. His total score would be 90, but the mean score (ignoring missing data) is 10. The next day he only answers three questions. His total score would be 30, but again if we ignore missing data, his mean is 10, because he scored 10 on all the questions he did answer. The mean of Jo’s responses on each of the items is a more accurate measure regardless of how many questions he doesn’t answer. The mean always reflects his highly depressed state.

Now obviously, there are limits to this. We could still take this approach even if even if a participant only answered 2 questions on a given scale. But of course there’s a reason most scales have more than just one or two questions. Psychometricians aim to create scales that measure all the important aspects of a given construct. Missing out some questions means we might no longer be measuring all those aspects. Also, scales become less reliable in proportion to their length (cf. the Spearman-Brown formula if you’re interested).

As a result, if you have more missing data, the results you’ll get from that participant are much less likely to be reliable. Most psychologists set a limit of 20% for missing data that can be ‘replaced’ or ‘pro-rated’. Pro-rating is still somewhat controversial amongst psychologists, even though many have considered it usual practice for decades and it is commonly used in a number of high-stakes assessments such as IQ tests. My take is that the evidence is good enough to suggest that pro-rating for even 20% missing data gives sufficiently accurate estimates of most constructs that we need not worry about it too much. See here and here.

Remember though that logically, a scale with lower internal consistency (i.e. its items don’t correlate highly amongst themselves) will produce less reliable estimates after pro-rating for missing data because items are less perfect stand-ins for each other. And indeed there is evidence from simulation studies which suggests that pro-rating, as a procedure, makes unrealistic assumptions about the structure of psychological data and may therefore produce biased estimations. If you’re completing an undergraduate or MSc dissertation in psychology, pro-rating is almost certainly enough.

If you’re long beyond undergrad level, don’t forget to read this post on assumption checks you should perform.

The simplest way to prorate for missing data is to use the MEAN() function in SPSS or Jamovi. Both packages will ignore missing data and calculate a mean score anyway. Note that we’re not calculating a mean for the sample here. We’re calculating the average item response for a single participant.

In various textbooks you might see another approach to pro-rating where you’re told to manually calculate the mean. Say Jo scored 1, 3, 5, 2, 4, _, 7 (where _ is a question Jo didn’t answer), then these textbooks will say you should calculate the mean of 1, 3, 5, 2, 4, 7 and put that answer (3.6) into the _ space. This works too, but it requires you to calculate dozens, maybe even hundreds of scores by hand. And it’s mathematically equivalent to the approach shown here, where Jamovi or SPSS do the work for you.

The ‘fill in the blanks’ method just described also has another huge problem. There are some analyses (like Cronbach’s alpha) where we shouldn’t ever use prorated scores. If you fill in the blanks using manually calculated mean scores, you’ll no longer know which answers were actually missing. You wouldn’t be able to do Cronbach’s alpha and other such calculations on your data.

Prorating in SPSS

There is a Jamovi equivalent to this section, just below.

The basic idea is simple. Using the same Transform > Compute option from the menu bar, we ask SPSS to calculate the mean average of the items for the scale, rather than a sum total. Shown below.

Notice that we use the function MEAN() and we put our variables inside the parentheses, separated by commas. (The spaces after the commas are optional, but they do make things look nicer.) You can put as many variables inside the parentheses as you need to, so long as you put a comma between them.

SPSS ignores missing values and just storms ahead and calculates means whenever we ask it to. We can do the same thing directly from syntax, if we prefer….

COMPUTE aaq_mean_item_sc = MEAN(aaq_1, aaq_2, aaq_3, aaq_4, aaq_5, aaq_6, aaq_7).
EXECUTE.

But as I noted above, psychologists usually set a limit on the number of missing data points that can be prorated. The usual cut-off is 20% of the items on a scale. For our example, we have 7 items in our scale, so 20% of 7 is 1.4, and we round down to set our limit at 1 missing item. In other words, people still have to have answered 6 of the 7 questions on our scale.

In the Compute Variable dialogue (or in Syntax) you can add a number to the MEAN() function to tell SPSS how many valid entries there must be for a participant for it to calculate a mean.

COMPUTE aaq_mean_item_sc = MEAN.6(aaq_1, aaq_2, aaq_3, aaq_4, aaq_5, aaq_6, aaq_7).
EXECUTE.

The final thing to consider is whether you actually do need total scores, and their prorated equivalent. For instance, you might be working with a scale that has clinical cut-offs or other norms, e.g. “A score above 32 means significant impairment“. In such cases, our mean scores would be much lower than totals, so we need to scale them back up to be equivalent to totals. It is easily done.

The mean of some numbers is just the total of those numbers divided by how many there were. So to get back from the mean to the total, we just do the opposite — instead of dividing we multiply by how many there were:

\text{MEAN}(A, B, C) = \frac{A + B + C}{\text{number of items}} \newline

We can rearrange:

\text{MEAN}(A, B, C) \times \text{number of items} = A + B + C

The multiplication (by however many items are in your scale) can just be tagged on to the end of the compute command:

COMPUTE aaq_mean_item_sc = MEAN.6(aaq_1, aaq_2, aaq_3, aaq_4, aaq_5, aaq_6, aaq_7) * 7.
EXECUTE.

Prorating in Jamovi

This is the Jamovi equivalent of the SPSS section just above.

The basic idea is simple. Using the same Compute button from the menu bar, we ask Jamovi to calculate the mean average of the items for the scale, rather than a sum total. Shown below.

Notice that we use the function MEAN() and we put our variables inside the parentheses, separated by commas. (The spaces after the commas are optional, but they do make things look nicer.) You can put as many variables inside the parentheses as you need to, so long as you put a comma between them.

Jamovi ignores missing values and just storms ahead and calculates means whenever we ask it to.

But as I noted above, psychologists usually set a limit on the number of missing data points that can be prorated. The usual cut-off is 20% of the items on a scale. For our example, we have 7 items in our scale, so 20% of 7 is 1.4, and we round down to set our limit at 1 missing item. In other words, people still have to have answered 6 of the 7 questions on our scale.

In the Computed Variable box you can add a number to the MEAN() function to tell Jamovi how many valid entries there must be for a participant for it to calculate a mean. We do this by attaching the number to a ‘flag’ inside the function like this:

The final thing to consider is whether you actually do need total scores, and their prorated equivalent. For instance, you might be working with a scale that has clinical cut-offs or other norms, e.g. “A score above 32 means significant impairment“. In such cases, our mean scores would be much lower than totals, so we need to scale them back up to be equivalent to totals. It is easily done.

The mean of some numbers is just the total of those numbers divided by how many there were. So to get back from the mean to the total, we just do the opposite — instead of dividing we multiply by how many there were:

\text{MEAN}(A, B, C) = \frac{A + B + C}{\text{number of items}} \newline

We can rearrange:

\text{MEAN}(A, B, C) \times \text{number of items} = A + B + C

The multiplication (by however many items are in your scale) can just be tagged on to the end of the compute command:

The final thing to consider is whether you actually do need total scores, and their prorated equivalent. For instance, you might be working with a scale that has clinical cut-offs or other norms, e.g. “A score above 32 means significant impairment“. In such cases, our mean scores would be much lower than totals, so we need to scale them back up to be equivalent to totals. It is easily done.

The mean of some numbers is just the total of those numbers divided by how many there were. So to get back from the mean to the total, we just do the opposite — instead of dividing we multiply by how many there were:

\text{MEAN}(A, B, C) = \frac{A + B + C}{\text{number of items}} \newline

We can rearrange:

\text{MEAN}(A, B, C) \times \text{number of items} = A + B + C

The multiplication (by however many items are in your scale) can just be tagged on to the end of the compute command: