In a world without probability there would only be pure logic. In pure-logic world you could say something like “this coin will come up heads on the next toss” but never “this coin is equally likely to come up heads or tails”. This is because everything that doesn’t evaluate to true/false (such as “equally likely”) cannot be subjected to logical analysis.
Probability to the rescue! According to Wikipedia, probability is a measure of the likeliness that an event will occur. note Using probability, we can now analyze likely, unlikely, equally likely etc. events in addition to merely truthness or falseness.
Probability of what?
We use probability to get in touch with reality. The frequentist school and the bayesian school do this in different ways and I find it illuminating for understanding the bayesian approach by looking at a contrasting approach.
Frequentist get in touch with reality through the long-run frequency at which samples would be representive of reality. In the frequentist approach, probability is the frequency with which (future) repetitions of the data collection will yield the event in question, i.e. 95% of the time. Frequentists do not directly make inferences about the true state of the universe.
Probability theory (Bayesian) get in touch with reality by calculating the probability that a range of universes are actually the universe we inhabit, given a sample. In the bayesian approach, probability is the degree of belief that we have that the universe is in a particular state that generated the event, i.e. that a parameter has the value 42. Bayesians do not directly make inferences about future data, although a good model of the universe is really useful if you want to make predictions!
For excellent and fun examples about how frequentists and bayesians would debate, see this StackOverflow thread.
What’s fixed and what’s probabilistic?
In the frequentist school, we learn about it’s P(D | M) which reads “probability of the data (probabilistic) given the model (fixed).” In other words, the parameters informs us about our observations. For example, the p-value is the long-run probability of observing the data under the Null hypothesis (parameter=0). The world is regarded as fixed (Maximum Likelihood Estimates gives single values for each parameter) and the data is regarded as probabilistic.
A frequentist thinks: “We know what our hypotheses about the universe are (they’re fixed). These hypothesis predict data with some randomness to it and we just sampled some of that data. If the data are perfectly representative of the universe (the MLE estimate), how likely is it that future samples would be congruent with that?”.
The bayesian school learns about P(M | D) which reads “probability of the model (probabilistic) given the data (fixed).” In other words, the observations informs us about the parameters. It is exactly opposite the frequentist approach in this respect.
A bayesian thinks: “We collected these data, no doubt about it (observations are fixed). This data could have come about in a lot of different universes (probabilistic) and we’re going to infer how likely it is that each of these universes is actually our universe, given our observations.”
Common misunderstanding 1: parameters aren’t ontologically random.
I often hear people saying “parameters are random in Bayesian statistics”. If you think back to the basic axioms of probability theory (the Kolmogorov axioms), it states that universes (variables taking on particular values) are mutually exclusive and that we must always be in exactly one universe, i.e. have only one value for each parameter. So the ontology here is that parameters take on exactly one value but we just don’t know exactly which one. We can, however, evaluate the probability of each single value being the actual value.
This is what our prior and posterior distributions are about. So, to be clear, (joint) distributions assign probabilities to possible universes, one of which is the actual one that we’re in. And since we cannot completely discard these universes, we’re going to entertain all of them in our analysis and conclusions. But we only entertain them exactly to the degree that they are probable.
COMMON MISUNDERSTANDING 2: “subjective belief” isn’t vague
Bayesian statistics quantifies belief given data and prior knowledge. The use of the terms “belief” and “prior” has been called “subjective” and this makes a non-scientific impression of Bayesian statistics on some. However, it’s a pretty hardcore everything-but-vague belief state as it is entirely numeric and derived from mathematical axioms.
The critique of priors and belief as unscientific come from frequentists. But the frequentist interpretation of probability could easily be argued to require an even greater degree of belief to uphold assumptions that are not readily justified by the data. In particular, frequentist statistics usually assume that the distributions of different partitions of data is exactly Identical, Independent and Normal. In addition, to quote Jaynes (2003), the frequentist interpretation is about “nonexistent data sets” (outcomes from replications that will likely never be carried out) and “unobservable limiting frequencies” (even if we did replicate, we cannot replicate forever because of the ultimate fate of the universe).