Next Upcoming Google+ Hangout: Tuesday, August 27 @ 7PM (CST) - To Participate CLICK HERE

STATS MAKE ME CRY

analyze. interpret. defend.

Search For Topics/Content

Top

Discussion Forum Topics

Find a Blog Entry

Missing Data/Imputation Discussion > What is the right way to impute more variables?

The lit says that when a researcher creates an MI model, they can choose to impute for a specific analysis/model or to impute for an entire dataset. I would like to impute for the whole dataset; however, I have 120 participants, 150 variables, 5 time points, and up to 50% missing data (for some variables), 20% for others. I have data missing due to randomly omitted items, skipped questionnaires, and refused interviews. Long story short, when I use SAS, I cannot fit all of my variables in the model (and yes, I dummy code them appropriately). My question is, in my situation, is it possible (and a good strategy) to impute for the entire data-set? And if so, how?

A statistician that I know of did suggest a method for including more items in the MI model: first, you fit as many variables in as can fit and impute (e.g. 25 datasets). Next, you can merge in the original dataset (dropping vars that were imputed), then you rerun the model and this time you can include more variables. I think you set the number of imputations to 1 this time, because you already have 25 datasets. You continue the merging and running of the MI model until you have imputed all remaining variables of interest (number of imputations still set to 1).

Do you think this method could be used to impute the entire dataset?? I'm uncomfortable with the downfalls of running only 25 imputations, and setting the remainder to 1; but I don't know enough to reason about whether this strategy is good or not. I would very much value your thoughts for how to impute the whole dataset!!

Thank you!

July 17, 2012 |

Alyssa

how come no one replied, doesn't my question make sense?

July 25, 2012 |

Alyssa

Sorry for the delay in my response Alyssa, but for the most part, I'm the sole responder to these questions, so sometimes I can't get them answered as quickly as I'd like. As for your question, it's a great one, but also a difficult one. In general, I prefer to impute at the model level, as opposed to the data set level. However, sometimes I'll plan a few models ahead, and impute them together, but that isn't the same as imputing all variables in the data set, regardless of whether you're going to use them in a model.

I would especially be hesitant to impute an entire data set when the rates of missingness are as high as you described, because your imputations are likely to contain A LOT of noise, which isn't going to help your ability to test hypotheses, and has the potential to produce inflated standard errors in your models.

As for the advice you got from the statistician, I can't speak to whether it is valid or not, because I can't say that I'm familiar with what you were describing. Most recent multiple imputation literature that I've read, has indicated that only 3 to 5 imputed data sets are needed to use, with any more than that adding little value to the accuracy of your models. Keep in mind, hundreds or even thousands of iterations typically used to create each data set, to be sure that each one is robust.

I typically prefer to use the software R to do multiple imputation, and The "MI" package specifically. The author of the package creates great documentation to walk through how to do the procedure. Here is a link to his cran page:

http://cran.r-project.org/package=mi

I hope this is helpful, and again, sorry for the delay of a few days.

July 28, 2012 |

Jeremy Taylor

This work is licensed under a Creative Commons License.