Next Upcoming Google+ Hangout: Tuesday, August 27 @ 7PM (CST) - To Participate CLICK HERE

Search For Topics/Content

Regression Discussion > Degrees of freedom problem

I'm trying to build a predictor model for college enrollment. The long cycle in annual data and the likelihood of structural changes in the model limit me to exceedingly few observations -- probably 10. In truth, there are FAR more than 10 relevant independent variables for this model, and I'm struggling to come up with a method for building a robust predictor without the necessary degrees of freedom. I hit upon an idea that can't possibly work; I'd like to know (a) what's wrong with my thinking, and (b) how I actually SHOULD proceed.

My thought was to break up total enrollment into enrollment by category (high school seniors, unemployed locals, military veterans, senior citizens, etc). Each CATEGORY would then have 10 observations, and I could build an extremely simple univariate or bivariate model to loosely predict those enrollments. THEN, I could add up the predicted enrollments by category, find the percentage deviation from the actual enrollment, and try to predict THAT using an additional univariate or bivariate model.

In other words, my process would look something like this:

1. Use a simple regression to predict the number of students who would enroll by category.
2. Add the totals to get the Total Predicted Enrollment.
3. Find our how far Total Predicted Enrollment deviates from Total Actual Enrollment (as a percentage of the predicted value).
4. Run a regression on separate variables to predict this (constructed) percentage deviation.
5. For forecasting, I could then use my simple regressions to forecast categories and my aggregate deviation to determine my "fudge factor" for either increasing or decreasing my overall prediction.

Intuitively, this process feels all wrong -- I feel like I'm running some kind of degrees-of-freedom Ponzi scheme, stealing degrees of freedom from hard-working blue-collar statisticians for my own profit. Is my methodology deeply flawed or defensible? If this approach would work, how would I have to adjust my error bands to account for the separate stages of nested regressions? If this approach is laughably unacceptable, can you think of a replacement method that would be reasonable?

Thank you so much for any help you can provide; I'm feeling a bit befuddled at the moment.

June 28, 2012 | Unregistered CommenterSasha