I recently wrote about a systematic, bias-minimizing approach to exploratory data analysis and model identification. This article clarifies my preferred process and adds two exceptions to the typical process.
I begin by collecting data of interest. The data sets aren’t randomly assembled, but neither are they assembled with a particular operationalization in mind. I am happy to measure the same concept multiple ways and leverage the most statistically effective measure in my model. This is the horse racing concept previously discussed.
More formally, I create models in 4 stages:
- The kitchen sink regression maximizes total explanatory power, R2.
- This is a useful step in an EDA as it establishes an upper bound on explanatory power.
- This regression will generally include many insignificant factors, and so is not preferred. In application, this would indicate spending money to obtain irrelevant information.
- The weak model includes all factors with p < 0.5.
- This p-value indicates that the probability of error is less than its negation. That is, each of these factors are individually expected to posit a real relationship.
- With as few as two factors at p = 0.4, however, it is likely that at least one factor is erroneously identified as meaningful.
- This model is considered inefficient from a complexity perspective, but it may still be ideal from an economic or business perspective. Compare the return on investment to a marginal model observation in your particular situation to identify the cost-preferred model.
- As an exploratory research tool, this model is great! It highlights factors that could use additional inspection. This model is basically a record of research opportunities.
- The medium model maximizes adjusted explanatory power.
- As a rule of thumb, factors in these models are generally p < 0.35.
- Adjusted explanatory power is an arbitrary but standard complexity measure. This model is indicated as having an optimal amount of explanatory power per complexity under certain assumptions. This makes it nice for several reasons, but it does not mean that it is the most useful model from an economic or business perspective.
- The strong model includes only factors that are individually significant.
- p-value analysis is quite arbitrary, but it is a way to quickly filter down to the most important factors.
- Models of this sort are generally relatively cheap to implement.
- The factors in this model provide the largest explanatory power with the least complexity.
- You will need to check to make sure the total amount of explanatory power is enough to constitute a useful model.
- Again, look at the return on investment of implementing a model.
- If you are screening applicants and a statistical tool can explain 1 percent of bad applications, it may or may not make sense to use the tool, depending on how costly the tool is to create and use on an ongoing basis.
- An example of an ongoing cost might be that applicants need to include more information in their application to support the model, and this might deter application or make application processing more complicated.
So those are my 4 stages. We know that I like to start with a kitchen sink and eliminate factor-by-factor based on significance, but here are three exceptions:
- Sometimes eliminating a variable can cause adjusted r-squared to go down and at the same time cause another variable’s p-value to spike. If it spikes over 0.25, test removing it and see if the adjusted r-squared goes back up. You may have encountered a local, rather than global, adjuster r-squared maximizing model.
- If multiple variables have the same p-value then don’t just drop them all. Compare all models generated by dropping only one of each of them, and follow the reduction with the best model.
- Sometimes a “simplicity transform” can generate improved adjusted r2. In other cases ar2 may remain the same or decrease to an unimportant degree, but it may still be preferred.
- A simplicity transform involves replacing a variable with a reduced-power variant. Most commonly I see the form Y = X^2 + X^3 can be transformed to Y = X +X^2 without important negative impact, or even generating a model power benefit.
- Simplicity transforms may be preferred even with slight technical power reduction due to interpretability. As model users, we want to understand what our model is doing. Direct effects and even marginal effects are logical, intuitive, and have rich theory to back them. Cubic effects are hard to interpret and often have very little theory to justify them.
- Cubic effects can be seen as “marginal effects of marginal effects” which is to say that they may cause a quadratic effect’s sign to flip over some range, but it’s never immediately obvious whether the model-applicable range includes the sign flipping threshold, and if your model doesn’t have an established quadratic effect to begin with this interpretation is nearly useless. Moreover, the mechanism for a “marginal effect of marginal effects” is rarely clear. One exception would be in physics, where jerk is mechanically well-established.