This article will treat econometric method with economic analysis. There are three points of econometrics I would like to take issue with, based on economic reasoning:
- Convenience data is bad
- Low n is bad
- Omitted variable bias is the only thing
I would like to argue against all three of these based on the decreasing marginal return of each factor, and the significant and important change from 0 to 1 units of each variable.
I agree that convenience data is bad in the sense that its conclusions will be less certain and its results less valuable than the conclusions of another trial run in a similar way with superior data. However, superior data is more expensive to obtain. There is no such thing as perfect data, although more money will often buy you better data.
However, there is a decreasing marginal return to the quality of data. A less certain and valuable conclusion is sometimes preferable because benefit should outweigh cost and convenience data can be non-linearly cheaper than the alternative, and sometimes free.
A bad experiment is one which yields no information. Using a survey because is a fine substitute for an elaborate observational study if it can produce a result, and if it doesn’t produce a result there is still no room to criticize until the counterfactual is demonstrated as an effective alternative.
Sample size is a consideration of data quality and the same point applies, but it is perhaps a bit more clearly true. Consider a political surveys done in four trials and similar fashion. One trial has n = 1, n = 10, another n = 100, and another n = 500. I would make two points other researchers generally discount:
- Even the trial with n = 1 or n = 10 may result in superior signalling compared to n = 0.
- n = 500 in the real world has very little statistical advantage over n = 100, but it may have a substantially higher cost. Sometimes the cost outweighs the benefit.
Bottom line: If we want to maximize research productivity and output, it is imperative that researchers often use data which is not the highest quality available.
Last, omitted variable bias is not the only thing: Excessive variable bias is a thing. If you have to instruments measuring the same concept then the result may appear falsely insignificant. Moreover, the explanatory of the model has diminishing returns relative to the complexity, construction costs, and use costs of that model.
I would also add that the professional obsession of researchers with OVB itself causes, I think, a bias toward having too many variables.
I have seen an experiment where a researcher wanted to know the effect of income on calorie intake and they corrected for consumption of soda. They literally corrected for one of the things they were trying to explain. That’s ridiculous.
So I think all reasonable people can quickly agree from the marginal cost minus marginal benefit argument that there is at least a Goldilocks number of variables, but I would go a bit further than that.
I think there may be some Goldilocks model which we might call the rigorous model, but I think another useful framework is the single-variable model. This model does what models should do: It simplifies. Single-variable models allow us to be sure there is no bias in variable selection. As I said years ago, “Even if we correct for everything we can think of this is still a kind of selection bias.”
So while researchers may denigrate single-variable models as the convenience data of models, I would encourage them to view such models in another light: They are the non-biased, gold standard, random sampling of the model world, and they often yield an immense return to complexity, construction, and use costs.