Contra Sample Splitting

Marek Kirejczyk discussed a negative trend in software development called Hype Driven Development. I’m here to argue the same thing happens in data, econometrics, and academia.

I’ll give two examples: the p-value and sample splitting. My real focus here is to convince the reader that sample splitting is a trendy trick but it is in fact bad for analysis.

The standardization of the .05 alpha level for a successful p-value test was never logically derived and it was always a professional norm. Essentially, it was all hype. Now the profession is finally starting to change.

Now there’s another thing called sample splitting. I’ve seen it used in a few contexts:

  1. Threshold Estimation
  2. Interaction Modelling
  3. Robustness Testing
    1. A more involved discussion with Susan Athey of Stanford University here.

At this time I’m only prepared to criticize #3. The idea here is that if you have a large sample you can generate two random subsamples, derived a model on one, and forward-test it on the other. This is an exceedingly weak robustness test and in fact this process will cause you to lose particular patterns. Consider the following:

  1. Sample splitting is not real forward-testing or meta-analysis. The following occur in real cross-sample analysis, but not in sample splitting:
    1. Reduced exposure to measurement bias.
    2. Reduced exposure to cherry-picking specifications.
    3. Implicit robustness checks due to various unexpected and/or unobserved differences.
  2. Sample splitting may cause certain independent variables to appear in one sample and not the other.
  3. Sample splitting may cause certain cross-relations between independent variables to exist in one sample and not the other.
    1. Even if both variables are in both places, the significance of a cross-correlation may be lost.
    2. This may be due to the absence of variance-reducing covariates in one or another subsample.
  4. Sample splitting may cause an analyst to miss the significance of variables which are significant but have a high variance.
    1. Using the full/pooled/aggregated sample will allow the analyst to identify the coefficient more precisely.
    2. If you are really stuck on sample splitting, consider pooling afterward and adopting the pooled coefficient, esp when the confidence intervals on either sample are consistent.
  5. Since you’re not doing real cross-sample analysis as per (1) then really your just checking for variable significance with smaller n.
    1. This is cool, but you don’t need to engage the whole sample splitting process to do that.
  6. Why 2 subsamples? A real test would leverage, at best, f(n) subsamples and check for robustness at each level.
    1. f(n) is like this: Start with one sample, then two, then three, and so on until each subsample has 2 observations each. Then you’re done, with #subsamples = n/2.
    2. The point of this is to check for cross-sample durability of variable significance. This is supposedly the lauded quality of sample splitting.
    3. But, the same comprehensive durability check is implicit in the ordinary p-value of the specification in the aggregated sample.

It is really point 6.3 which makes my case most strongly. I am not asserting that sample splitting does not check for robustness. I am asserting that it checks for robustness in a way which is mathematically equivalent to not having split the sample, and as a result is an utter waste of time in practice.

Tangentially: Becker and Stigler discuss the existance of fashions and fads about scientific doctrine in the classic De Gustibus Non Est Disputandum.


Leave a Comment