Beta(alpha, beta) can be understood as representing a distribution of probabilities, i.e. the probability of success in a Bernoulli trial. It represents all the possible values of a probability when we don’t know what that probability is. It’s particularly useful for binomial outcomes and setting priors for binomials and that’s the intuition I’m going to flesh out here.

**alpha – 1**is a number of successes.**beta – 1**is a number of fails.**Beta(alpha, beta)**is a continuous pdf which quantifies our belief in the probability of successes relative to fails.

# one is zero

As you can see, I think of alpha and beta values of 1 as a kind of 0 (zero information). So it’s really **Beta(alpha + 1, beta + 1)** and alpha and beta is the number of successes and fails I’ve observed in all my life or the number of successes and fails I believe the universe has produced before I quantify my belief in the underlying probability. This latter belief is as strongly as if I actually observed these outcomes.

Notice, that if I actually know the underlying probability, then there’s no distribution and thus no need for the Beta distribution to represent it.

See this stackoverflow Q+A for an excellent intuitive description of the beta distribution.

# Examples

**Beta(1, 1)**is a flat uninformative prior, meaning that**you don’t know anything at all**about the value of of the probability determining the outcomes. This corresponds to having observed nothing, not even other probabilities which you could use as background information that probabilities in the real world are seldom 0 or 1. Beta(1, 1) means that any value, whether 0, 1 or 0.4532 is equally likely to you.**Beta(2, 1)**corresponds to having observed one Bernoulli trial (e.g. a coin toss) and it was a success. Likewise, Beta(1, 2) is one fail. Again, this single trial is all you have ever experienced in the whole history of the universe and now you want to quantify your belief in the underlying probability that caused this trial. Beta(2, 1) is a straight line which is zero at p=0, 1 at p=0.5 and 2 at p=1. This makes sense since we know for sure, that the probability of success is not zero – we just observed a success and that is incompatible with p=0! Also, we think that it’s double as likely that this universe produces pure successes (p=1) than a balanced mixture of successes and fails (p=0.5). More generally, having observed this single outcome, we think that it’s double as likely to see a certain proportion of successes (e.g. p=0.6) than half that proportion of successes (e.g. p=0.3).**Beta(2,2)**(one success and one failure) is zero at p=0 and p=1 since we now know that this universe can produce both successes and fails. It’s maximal at p=0.5 meaning that our best guess is that if we continued collecting trials in all eternity, the outcome would on average represent the two trials we’ve seen so far, though it’s certainly compatible with other values of p.**Beta(7, 3)**is having observed five more successes and one fails since the Beta(2, 2). It’s heavy towards 1 with a maximum at of p=0.75 since 6 out of 8 observed trials was a success.**Beta(40, 10)**is narrower since the large amount of data makes it increasingly improbable that we’re in a universe with a probability of success much different from p=0.8 (40/50). For example, there’s only 0.0005 % chance that the underlying probability is below p=0.5.**Beta(2.5, 1.1)**corresponds to a vague belief that I might once have observed a fail and a vague belief that I’ve observed between one and two successes.**Beta(0, 0)**is not possible, consistent with the beta(alpha + 1, beta + 1) intuition.