I was reading this interesting preprint on ordinal regression using brms (which I recently praised in a tutorial on computing Bayes factors) by Paul Bürkner and Matt Vuorre. Now see this footnote about their vocabulary:
Hallelujah and Eureka!! I think that these terms may help solve (some of) the long-standing confusion about the difference between “fixed” and “random” effects.
The mathematical distinction is that Varying (previously “random”) parameters have an associated variance while Population-level (previously “fixed”) parameters do not. The major practical implication is that Varying parameters have shrinkage in a regression towards the mean-like way whereas the Population-level parameters do not.
However, understanding which of these to use to model real-world phenomena is not self-evident from this mathematical distinction. So let me try to unpack why I think that the terms “Population-Level” and “Varying” convey this understanding pretty well.
Values of Population-level parameters are modeled as identical for all units.
Everybody who could ever be subjected to the treatment X would get an underlying improvement of exactly 4.2 points more than had they been in the control group. Repeat: Every. Single. Member. …of this population of treated individuals!
The example above could be a 2×2 RM-ANOVA model of RCT data ( outcome ~ treatment * time + (1|id)) with treatment-specific improvement (the treatment:time interaction) as the population-level parameter of interest. Populations could be all people in the history of the universe, all stocks in the history of the German stock market, etc. Again, the estimated parameters are modeled as if they were exactly the same for everyone. The only thing separating you from seeing that all-present value is a single residual term of the model, reflecting unaccounted-for noise. I think that modeling anything as a Population-level parameter is an incredibly bold generalization of the sort that we hope to discover using the scientific method: the simplest model with the greatest predictive accuracy.
Now, it’s easy to see why this would be called “fixed” when you have a good understanding of what it is, but as a newcomer, the term “fixed” may lead you astray thinking that either (1) it is not estimated, (2) that it is fixed to something, or that its semantically self-contradictory to call a variable fixed! Andrew Gellman calls them non-varying parameters, and I think this term suffers a bit from the same possible confusions. Population-level goes a long way here. The only ambiguity left is whether parameters that apply to the population also apply to individuals, but I can’t think of a better term. “Universal”, “Global”, or “Omnipresent” are close competitors but they seem to generalize beyond a specific population so let’s stick with Population-level.
Values of Varying parameters are modeled as drawn from a distribution.
Example for (1|id) :
Patient-specific baseline scores vary with SD = 14.7.
Example for (0 + treatment:time | id) :
The patient-specific responses to the treatment effect vary with SD = 3.2 points.
This requires a bit of background explaining so bear with me: Most statistics assume that the residuals are independent. Independence is a fancy way of saying that if you know any one residual point, you would not be able to guess above chance about any other residuals. Thus, the independence assumption is violated if you have multiple measurements from the same unit, e.g., multiple outcomes from each participant since knowing one residual from an extraordinary well-performing participant would lead you to predict above-chance that other residuals from that participants would also be positive.
You could attempt to solve this by modeling a Population-level intercept for each participant ( outcome ~ treatment * time + id), effectively subtracting that source of dependence in the model’s overall residual. However, which of these participant-specific means would you apply to an out-of-sample participant? Answer: none of them; you are stuck (or fixed?). Varying parameters to the rescue! Dropping the ambition to say that all units (people) exhibit the same effect, you could estimate the recipe on how to generate those intercepts for each participant which helped you get rid of the dependence of the residuals (or more precisely: model it as a covariance matrix). This is a generative model in the form of the parameter(s) of a distribution and in GLM this would be the standard deviation of a normal distribution with mean zero. In R, this is typically written like outcome ~ treatment * time + (1|id).
One way to represent this clustering of variation to units is a hierarchical model where outcomes are sampled from individuals which are themselves sampled from the Population-level parameter structure:
For this reason, I think that we could also call Varying parameters sampled parameters. Anyway, there are two main implications of modeling as Varying rather than Population-level:
- Firstly, it models regression towards the mean for the varying parameters: Extreme units are shrunk towards the mean of the varying parameter since those units are unlikely to reflect a true underlying value. For example, if you observe an exceptionally large treatment effect for a particular participant, he or she is likely to have experienced a lesser underlying improvement, but unaccounted-for factors exaggerated this by chance. Similarly, when you observe exceptionally small observed treatment effects, it is likely to reflect a larger underlying effect masked by chance noise.
- Secondly, it requires enough levels (“samples”) of the Varying factor to estimate its variance. You just can’t make a very relevant estimate of variance using two or three levels (e.g. ethnicity). Similarly, sex would make no sense as varying since there is basically just two levels. Participant number, institution, or vendor would be good for analyses where there are many different of those. For frequentist models like lme4::lmer, a rule of thumb is at least 5-6 levels.
Again, it’s easy to see why one would call this a “random” effect. However, as with “fixed effects,” it is just easy to confuse this for (1) the random-residual term in the whole model, or (2) the totally unrelated difference between frequentist and Bayesian inference as to whether data or parameters are fixed or random. Varying seems to capture the what it’s all about – that units can vary in a way that we can model. With variation comes regression towards the mean so it follows naturally.
A few sources that helped me arrive at this understanding was:
- This answer on Cross-Validated which made me realize that shrinkage is the only practical difference between modeling parameters as “fixed” or “random.”
- The explanation in the FAQ to R mailing list on GLMM, primarily written by the developers of the lme4 package.
- This paper [INSERT LINK] with recommendations for mixed models.