one thing i find mysterious is that this does seem to be the case even though the models are fundamentally next token predictors. in order for putting an explanation to improve the the “after” versions, the model must be internally doing long-range planning. but when making that long range planning explicit, as in the “before” variants, the quality degrades, perhaps because that is indeed a more unnatural and out of distribution kind of text to predict.