Missing Data/Imputation Discussion > Impact of multiple imputation on correlations
Good Afternoon Tina!
Thank you for your wonderful (and thorough) questions! I'm not entirely sure if I will be able to answer your questions with adequate depth and breadth, through this medium. However, I'd recommend checking out my "Google+" stats hangouts, which give you the opportunity to chat directly and ask these type of questions. With that said, I'd do my best:
Question 1: I don't think I could definitively say whether MI would be appropriate for your project, without thoroughly reviewing your data and project. However, I will say that Monte Carlo studies (simulation studies) have shown MI to be helpful in dealing with potential bias from missing data when as much as 50% of data is missing, so having a lot of missing data should not be reason enough to dismiss using MI.
Question 2: I'm not aware of a tendency for imputed data to demonstrate lower correlation coefficients then the original data, so I would guess this would be related to your imputation model. This may be a sign that some important factors for predicting the missingness of your data remaining missing from your imputation model.
Question 3: again, I'm not aware of this tendency to occur with imputed data, so I would consider what variables may be missing from your imputation model that would have this effect.
Question 4: I would recommend considering how you might improve your imputation model to get the least biased in your results.
Question 5: it may be possible to include "too many" variables in your imputation model, if a large portion of those variables are not substantially related to your missing variables and/or their likelihood to be missing.
I hope this has been helpful and best of luck!
Dear all,
I have been attempting to use multiple imputation (MI) to handle missing data in my study. I use the mice package in R for this. The deeper I get into this process, the more I realize I first need to understand some basic concepts which I hope you can help me with.
For example, let us consider two arbitrary variables in my study that have the following missingness pattern:
Variable 1 available, Variable 2 available: 51 (of 118 observations, 43%)
Variable 1 available, Variable 2 missing: 37 (31,3%)
Variable 1 missing, Variable 2 available: 10 (8,4%)
Variable 1 missing, Variable 2 missing: 20 (16,9%)
I am interested in the correlation between Variable 1 and Variable 2.
Q1. Does it even make sense for me to use MI (or anything else, really) to replace my missing data when such large fractions are not available?
Plot 1 (http://imgur.com/KFV9y&CmV1sl) provides a scatter plot of these example variables in the original data. The correlation coefficient r = -0.34 and p = 0.016.
Q2. I notice that correlations between variables in imputed data (pooled estimates over all imputations) are much lower and less significant than the correlations in the original data. For this example, the pooled estimates for the imputed data show r = -0.11 and p = 0.22.
Since this seems to happen in all the variable combinations that I have looked at, I would like to know if MI is known to have this behavior, or whether this is specific to my imputation.
Q3. When going through the imputations, the distribution of the individual variables (min, max, mean, etc.) matches the original data. However, correlations and least-square line fits vary quite a bit from imputation to imputation (see Plot 2, http://imgur.com/KFV9yl&CmV1s). Is this normal?
Q4. Since my results differ (quite significantly) between the original and imputed data, which one should I trust?
Q5. I have included many variables (up to 120) in my imputation model, while I only have 190 observations (of which 118 are boys, as in the example). Could it be possible that these are too many variables in order to make valid predictions of imputed data? How many predictors can I include in my imputation model?
Thank you for your help in advance.
Tina