Behavioural scientists often collect data using multi-item measures of latent constructs. For instance, clinical psychologists measure anxiety and depression using self-report questionnaires composed of multiple psychometric items, whilst child psychologists measure developmental progress by asking parents batteries of questions. Missing data are extremely common on such questionnaires, and one usually finds that data are missing at the item level. In other words, participants miss out items, either by accident, or because they don’t want to answer certain questions. Thus, researchers are left with some data relating to a given construct for a particular participant, but not complete data.
The usual solution is pro-rating, where the mean of the completed items for a given scale (or sub-scale) is taken as the imputed value for any missing items on that scale. If you’re not familiar with the practice, I’ve written a post about it. As Mazza, Enders, and Ruehlman (2015) point out, this pro-rating practice is common across most of the behavioural sciences, and especially in psychology. But, as they and others also discuss, there are potentially serious issues with the practice.
Pro-rating makes a number of assumptions about the structure of the data. To my mind, for pro-rating to be an ideal solution, one needs make the following assumptions:
- The domain coverage of the completed items overlaps sufficiently with the domain coverage of the missing items.
Some scales explicitly sample behaviours from a variety of related domains and therefore breach this assumption. The most obvious examples are scales for mental health conditions which map items to diagnostic criteria. Answering a question about whether you struggle to sleep is a relatively poor stand-in for an item about whether you have lost the ability to enjoy things you once did.
- The scale has good internal consistency.
The basic idea behind pro-rating is that within a scale, items are so highly correlated with each other that they can to some degree stand in for one another. The basic idea therefore falls apart if the scale has poor internal consistency (i.e. poor inter-item correlations).
- A high proportion of the items are complete and can be used to calculate the scale score.
Estimating someone’s IQ from 90% of the items in an intelligence test is one thing, but giving them only 5% of the items and then pro-rating the rest would clearly be inferior. Nobody seems to agree on what the cut-off should be, but pro-rating 20% missing data seems to be routine (Mazza, Enders, and Ruehlman, 2015). Graham (2009) suggests proration might be considered more reasonable the more items used, and that we should never use less than half (!) the scale items.
- The mean score on the missing and completed items are similar.
Put another way, average item difficulty is reasonably well matched across the missing and completed items. See Enders (2010).
- Factor loadings or item-total correlations are similar across missing and non-missing items.
One way to think about this is that we would be introducing more error into the measurement if the missing items were those with the best loadings on the factor. Again, see Graham (2009).
Reading the above list, anyone familiar with a handful of psychometric scales will readily see that these are not always reasonable assumptions. But you will also see that it’s fairly straightforward to check these assumptions.
Pro-rating also causes two notable side-effects, of which researchers should be mindful:
- The definition of your scale now varies across participants. It no longer has k items. It has a different number of items relating to the amount of missing data for each participant. If your data are not missing at random (because, say, someone who scores high on a certain construct is unwilling to answer some of the questions relating to that construct) you may have extra problems to worry about.
- Pro-rated data will artificially inflate estimates of internal consistency. This is because you have just created values (for participants with missing data) as a linear composite of other values for that participant. For this reason, any estimate of internal consistency should be calculated without pro-rating.
A good many methodologists have therefore suggested that researchers should prefer other methods of dealing with missing data. These other methods are now generally possible in a range of statistical packages, but they are more computationally complex.
More importantly, at the time of going to press, they’re not universally available. SPSS and Jamovi are the two most used statistical apps in psychology. SPSS does include multiple imputation, but Jamovi does not (at least not as I write these words). Neither of them (yet) includes a proper implementation of Full Information Maximum Likelihood.
It is also worth mentioning that this method goes by a number of different names. Huisman (2000) calls it person mean substitution, Newman (2014) talks of using the mean across available items, and many psychometric companies’ manuals refer to ‘pro-rating’ (with or without the hyphen). The lack of an agreed name makes it very tricky to survey the literature on the topic.
I am not a statistician, and nor do I play one on the internet. But I am a working psychological scientist, and in my humble opinion, it is acceptable to make the commonplace and pragmatic decision to use prorating, so long as the above assumptions are checked, and the above warnings are taken into account. It goes without saying that these things should also be addressed in our reporting.
Enders, C. K. (2010). Applied missing data analysis. New York, NY: Guilford Press.
Huisman, M. (2000). Imputation of missing item responses: Some simple techniques. Quality and Quantity, 34, 331-351.
Mazza, G. L., Enders, C. K., & Ruehlman, L. S. (2015). Addressing Item-Level Missing Data: A Comparison of Proration and Full Information Maximum Likelihood Estimation. Multivariate Behavioral Research, 50(5), 504-519. doi:10.1080/00273171.2015.1068157
Newman, D. A. (2014). Missing Data. Organizational Research Methods, 17(4), 372-411. https://doi.org/10.1177/1094428114548590