manta ray

June 2022

Level 4: Other probability distributions

You're now an expert on uniform distributions and even understand how to generate a non-uniform discrete distribution, where the probabilities (or weights) of different events are not equal. That has enormous flexibility, but there are certain phenomena that would be very complex to generate, or even could not be generated, using only those tools. That's why other specialized families of distributions have been created. There is a huge number of families, so in this write-up I'll go over five of the most useful ones and how to generate them, and should you need to use any other distribution in the future, you'll have the understanding and tools to do it. But first, two important concepts: the mean and the variance.

Mean and variance

If you have a probability distribution (a model or formula), you can calculate statistics. A statistic is just an aggregate calculation. For example, the mean of the distribution is a (weighted) average of all possible outcomes (weighted by their corresponding probabilities). On a uniform distribution like we saw before, the weights are all equal and we can calculate an average like we always do; on non-uniform distributions, we need to use the probabilities as weights. On statistics, this is also called expected value. It's what we would "expect" to get on average, for a number of experiments. For example, for a uniform continuous distribution between 0 and 1, we "expect" the realizations to be around 0.5.

On the other hand, the standard deviation of a distribution is a measure of dispersion of a set of values, around the mean; this is, how far away from the expected value the distribution can deviate. To calculate this we actually calculate the variance, which is the average of the squared differences of each value with respect to the mean, and then we calculate the square root of it. A higher standard deviation (or variance) indicates a higher amount of dispersion, i.e., the distribution can vary widly. On our continuous uniform example, the standard deviation can be shown to be $1/\sqrt{12} \approx 0.288$. The general formulas are the following ^[1]:

$$\mu = \sum_{k=1}^{N} X_{k} P[X=k] $$

$$\sigma^{2} = \sum_{k=1}^{N} (X_{k}-\mu)^{2} P[X=k] $$

The above calculations are being done for the probability distribution or model. Hence, they are called population mean and variance. If we instead have an actual set of values from the distribution (for example, if we generated random numbers from that distribution using a PRNG), we can calculate the equivalent statistics in that sample. They are called sample mean and sample variance (or sample standard deviation, equivalently). The sample mean is just the arithmetic average from the sample; for the sample variance, several formulas exist which have different properties, but one of the most used is averaging $N$ values with respect to $N-1$ values, not $N$ (this is called the corrected sample variance):

$$m = \sum_{k=1}^{N} X_{k} P[X=k] $$

$$s^{2} = \frac{1}{N-1} \sum_{k=1}^{N} (X_{k}-m)^{2} $$

The idea is that the sample mean and variance can estimate the "true" mean and variance, so as you'll see later, we will be able to use these sample formulas to estimate some of the parameters of the distributions.

The Binomial distribution

The Binomial distribution is a discrete distribution that models the probability of getting a certain number of successes (or true or yes) within a number $n$ of independent experiments/trials with the same probability $p$. The usual example is getting $k$ heads in a series of $n$ coin flips. The distribution can be expressed as follows:

$$P[X = k] = \binom{n}{k} p^{k} (1-p)^{n-k}$$

Let's understand that formula. First, using the rule of product, out of the $n$ events, the probability of getting $k$ heads is $p^{k}$, and naturally the rest of the $n-k$ flips will be tails, which has probability $(1-p)^{n-k}$. We can multiply those two and we get one possible run. However, note that, for example, on a four coin flip with three heads, $H H T H$ renders the same probability than $T H H H$ and $H H H T$ etc. (all of them have just one tail). How many different combinations are there? The answer is $\binom{n}{k}$, the number of ways to choose $k$ elements from a set of $n$ (if order is not important).

Let's use the Binomial distribution to find out the probability of getting $k$ critical hits on five independent attacks of our hero in our turn-based battler. Let's assume we have a probability of a single critical hit of 10%:

$k$ critical hits	Probability
0	0.59049
1	0.32805
2	0.07290
3	0.00810
4	0.00045
5	0.00001

So, on five independent attacks, our hero would crit once in five times around 33% of the time, and at least once around 41% of the time (we can sum the probabilities of 1,2,3,...). What would happen if we got a buff and our critical hit chance went up to 35%?

$k$ critical hits	Probability
0	0.11603
1	0.31239
2	0.33642
3	0.18115
4	0.04877
5	0.00525

Now we see that our hero would crit once in five attacks with 31% chance and at least once in five attacks with 89%. The histogram of the Binomial distribution will take a different shape based on the probability. For low probabilities, the probability of small values in the experiments will be higher and the probability of large values will be lower. For high probabilities, the opposite will be true. For "medium" probabilities (close to 50%), the probabilities will be higher on the middle part of the distribution. Use the following GameMaker app to "play" with the Binomial distribution, by varying $N$ and $p$:

How could we simulate the Binomial distribution? Very easily, indeed, since we already have a function for generating numbers with arbitrary discrete non-uniform probabilities (see previous level). The algorithm would be the following:

Create an array of size $N+1$
Fill the array at position $k$ with $P[X=k]$ (using the Binomial formula)
Simulate random numbers using $[0,1,...,N]$ as the values array and the array from the previous step as the probability array

In fact, that's exactly how I created the GameMaker app above! Take a look at this extract from my GML code:


// comb() and fact() are functions to compute combinations and factorial 
for (var _k=0; _k<=n; _k++) {
	array_push(probs, comb(n, _k) * power(p,_k) * power(1-p, n-_k));
}

//... more code ommitted ...

for (var _i=0; _i<20000; _i++) {
	array_push(data, array_random([], probs));                           
}

The Binomial distribution is very useful to calculate the probability of independent experiments. Another way of saying this is sampling with replacement. This is, out of the $N$ experiments, we can always get any of the success outcomes. For example, if we are playing a card battler game and we have a card that asks us to roll a d20 three times and check how many times we roll 15 or more, the Binomial distribution is appropriate, because every time, all 20 numbers can come up (and, in particular, all numbers 15, 16, ..., 20), and we can measure our probability by using the Binomial formula with $N=3$ and $p=6/20$.

If, in contrast, we'd need to measure probabilities of success of non-independent experiments (sampling without replacement), we would need another distribution, named Hypergeometric distribution. For example, drawing a poker hand in a 5-card draw poker game can be modeled with that distribution, since once you draw a card, you can't draw it anymore (hence, without replacement). You can find more info online (e.g. here) and you can generate it is exactly the same as the Binomial.

The Normal and Lognormal distributions

Let's look at other distributions. This one will ring a bell... more literally than you think. The normal distribution is a continuous distribution that arises almost everywhere in the literature. It models the complete real number line. In and off itself, it is extremely useful because it's symmetric, most of the probability lies between two standard deviations of the mean and it's easy to understand. However, it is much more useful than other similar distributions because it appears in several important statistical results. You probably recognize, or have heard of, the "bell curve" - this is precisely the Normal distribution.

Let us check out the formula first:

$$f(x) = \frac{1}{2\pi\sigma^{2}} \exp\left(-\frac{1}{2} \left[\frac{x-\mu}{\sigma}\right]^{2}\right)$$

Daunting? Not necessarily. Let's eat the elephant one bite at a time:

First off, $\mu$ is the (population) mean and $\sigma$ the (population) standard deviation. These two are the parameters of the distribution.
The $\exp$ function is the exponential function.
The term $\frac{x-\mu}{\sigma}$ is called a standardization of x. We can standardize any random variable by substracting its mean, then dividing the result by its standard deviation. The idea is to normalize the scales when comparing several variables.
The first part is just a fixed number that makes the function integrate to 1.

Take a look at the following GameMaker app. It implements the Normal distribution formula and lets you play with the parameters. It actually plots three Normals: one with the standard deviation you choose and two others (with 2x the standard deviation and with half the standard deviation), so you can understand how it changes. Hover over the data points to view the values. Notice that the mean parameter modifies the mean and the x axis location of the "peak" of the curve (called the mode) and the $\sigma^2$ parameter modifies the amplitude of the distribution (more variance equals more dispersion):

As you can see, the majority of the distribution is contained "close" to the mean. In fact, around 95% of the probability is always contained between $\mu-2\sigma$ and $\mu+2\sigma$. If you change 2 to 3, the result is 99.7%. This is a very useful fact for all Normal distributions, independently of the values of $\mu$ and $\sigma$.

It turns out that the normal distribution models successfully a big amount of natural phenomena: body weight, body height, income, scores, blood pressure... This kind of quantities have a mean, are symmetric and most of the data falls within 2-3 standard deviations around the mean; the data resembles the "bell curve". So, all these phenomena can be modelled with the Normal distribution. Moreover, there is a very important result called the Central Limit Theroem, which implies that if you sum (linearly or weighted - so it also includes average) several independent random variables (say $n$), the resulting variable is distributed Normal, even if the individual variables are not. This is very strong, as it allows us to model the result without modelling the individual variables (many times it is very hard to model a specific phenomenon because it does not conform to any well-known distribution), and this result holds for very small values of $n$, such as 30.

Regarding simulation, there are various ways to generate normal variables. The easiest in my opinion is the Box-Muller transform, which is specific for Normal distributions and consists of the following simple steps:

Generate two continuous uniform random variables $u$ and $v$;
Calculate two standard normal variables $z_{1}$ and $z_{2}$ as follows: $z_{1} = \sqrt{-2\ln u} \cos{\left(2 \pi v\right)}$ and $z_{2} = \sqrt{-2\ln u} \sin{\left(2 \pi v\right)}$
If you need a normal with arbitrary $\mu$ and $\sigma$, perform: $Z_{1} = \sigma z_{1} + \mu$, $Z_{2} = \sigma z_{2} + \mu$

We now have everything we need to generate normal variables! Well... almost. We need to decide which parameters to use. We can use some sample data or our best guess to determine their values. This is called parameter estimation. In the case of Normal distributions, the parameters are exactly the population mean and standard deviation, so we just need to estimate those. But we already saw that we can use the sample mean and (corrected) standard deviation to do this (if we have data), so we're done.

There's just a small detail. The Normal distribution concentrates the majority of the outcomes in the region around the mean we've discussed, but that does not imply we cannot get (with low probability) a value outside that region. Say we're modelling average performance scores from 0 to 10 - we know that 99.7% of the outcomes will be within our range, but what if we end up simulating a score of -1.32? Well, for the purposes of this write-up, we can just throw that variable away and re-generate the value until it's within range.^[2]

Imagine we're building our tycoon/management sim game, and we need to generate a realistic human population, with ages, weights, heights, etc. Before this section, we would probably resort to using lengthy weighted discrete distributions, but even then, using those distributions would probably not result in the correct shape of our distribution, and given the fact we're working on a realistic simulation game, this really matters. But soon enough we'll see just how easy is to do it with Normals! Let's work on height for our example. We can look up real data for world population height (see here for example). We can see that the world male population has an average height of 178 centimeters with a standard deviation of 7.6 centimeters, while the world female population has an average height of 164 centimeters with a standard deviation of 7.1 centimeters^[3]. So, if we need to generate our people randomly, all we need to do is implement the Box-Muller transform and plug those numbers in (we'd be generating males separately from females, so we get different height distributions) and we would be ready.

Let's do it! First, here's the code for the Box-Muller implementation:


#macro PI 3.1415926535

function box_muller(_mu=0, _sigma=1) {
	var _u1 = random(1);
	var _u2 = random(1);
	var _z = choose(0,1) ? sqrt(-2 * ln(_u1)) * cos(2*PI*_u2) : sqrt(-2 * ln(_u1)) * sin(2*PI*_u2);
	return _sigma * _z + _mu;
}

The above implementation is not optimal since it "throws away" one of the two normal random variables, but it's good enough for our purposes. After that, it's trivial to do it. Here's my GameMaker Realistic Height People Generator^TM. You can interact with the values of $\mu$ and $\sigma$, of course, and you can check the corresponding distributions of people once generated, as well as check actual simulated individuals (graphic assets generated from the fabulous LPC Spritesheet Character Generator you can find here):

We can see that the Normal distribution is super easy to generate and has loads of uses!

There is also a distribution called Lognormal distribution, which basically is a variable whose logarithm is a Normal distribution: tf $Y$ has a Normal distribution, then $X = \exp({Y})$ has a Lognormal distribution. This distribution is not symmetric, but it takes only positive values, and can be used to model things like income distribution, lengths and other real-life measurements (sometimes it's better to model weights/heights with the Lognormal than with the Normal).

It is trivial to simulate a Lognormal distribution: generate a Normal variable, then transform the generated variable with the exponential formula before.

The Exponential and Poisson Distributions

I'll explore two more distributions that are useful and related: the Exponential distribution and the Poisson distribution. The first one is a continuous distribution and the second one a discrete distribution. How are they related then?

The Poisson distribution measures the probability of the number of independent occurrences of a specific event in a fixed time period or space interval. The most known phenomena we can model with this distribution are customers arriving to a store in an hour, calls entering a call center in one minute, or claims made to a company in a month. The assumptions are that the number of events happening, denoted $\lambda$, is constant in the studied period or interval; that no two occurrences can happen at exactly the same time; and that the number of occurrences can be theoretically unbounded (i.e. many occurrences could happen), but in practice the occurrences are "rare".

The formula for the distribution is the following:

$$P[X=k] = \frac{\lambda^{k} \exp(-\lambda)}{k!}$$

The parameter $\lambda$ is also the mean AND the variance of this distribution, so that both makes it easier to estimate and gives us an additional restriction. We can calculate $\lambda = rt$, where $r$ is the rate of arrival of the interval and $t$ the length of the interval.

The Exponential distribution, on the other hand, is a continuous positive distribution, which is mainly used to model the inter-arrival times between two Poisson occurrences. It also has a parameter $\beta$ called the "rate" parameter. The formula is the following:

$$f(x) = \beta \exp{(-\beta x)} $$

This distribution has a mean of $1/\beta$ and variance of $1/\beta^2$.

How can we simulate these? For Poisson, there are various algorithms. One of them could work by emulating the cumulative distribution "bin" method we saw in the discrete non-uniform case: we would generate a uniform and keep adding probabilities of each value (using the Poisson distribution formula above) until we fall into the bin. However, this might be too costly for large values of $\lambda$, since we have to examine many "bins". Also, this does not tell us anything about when each customer arrives, which is probably needed if we want to implement this into a game. Hence, the other idea is to take advantage of the relationship between the Exponential and the Poisson distributions. If $X$ is a Poisson variable with parameter $\lambda$, then an Exponential variable with parameter $\beta = 1/\lambda$ models the inter-arrival times. So, we can perform the following algorithm:

Initialize our simulated Poisson variable to 0.
Generate an exponential variable with parameter $\beta = 1/\lambda$. This will be the first inter-arrival time to the first occurrence of the event.
Calculate the cumulative time so far.
Increment our simulated Poisson variable and return to step 1 until the cumulative time exceeds 1 (i.e. 1 unit of time, whatever that might be, which it's the interval in which we're simulating our Poisson).
Substract one from our simulated Poisson variable and return it (because the last occurrence happens outside the interval, as stated in the point above).

So it's pretty easy and it only depends on us being able to generate an Exponential variable. Fortunately, this is very easy using the inverse cumulative transform method, which can actually be used for all distributions ^[4]. Algorithm below:

Generate a continuous uniform variable $u$ in (0,1).
Return $-\beta \ln(u)$.

That's all we need! Note that, if we need the inter-arrival times (hint: we probably do) we can return them as well from our Poisson generating algorithm. In this way, we are actually not just generating Poisson random variables, but a Poisson process. We can easily implement in GameMaker:


function Exponential(_beta) {
	var _u = random(1);
	return -_beta * ln(_u);
}

function Poisson(_lambda, _return_array = false) {
	var _n = 0;
	var _arrival_times = [];
	var _cumulative_time = 0;
	do {
		_n++
		var _exp = Exponential(1/_lambda);
		_cumulative_time += _exp;
		array_push(_arrival_times, _cumulative_time);
	}
	until (_cumulative_time > 1);
	array_resize(_arrival_times, _n-1);
	
	if (_return_array) return _arrival_times;
	else return _n-1;
}

The above code I implemented can either return the actual number (as in the number of occurrences of the event that happened in a unit of time), or the array holding the (Exponential) inter-arrival times (and, hence, the array length will be the actual Poisson variate).

Let's jump into a practical example where we're working on our coffee shop management game. For simplicity's sake we can assume the number of customers arriving every hour is constant. We might model the number of customer arrivals with a Poisson process with parameter 5 per hour (our shop is just starting up - for comparison, and not accounting for peak hours etc., the average number of customers arriving at a Starbucks per hour is more around 60!).

You can see how easy it is to simulate the Poisson process and how powerful it is! We can simulate plane arrivals, customers, calls... You just need to check the assumptions to make sure they adhere to the Poisson distribution, and divide the problem when they don't. For example, we could have simulated each hour of the coffee shop simulation with a different parameter to simulate peak hours.

We have seen a very small number of distributions. We've covered the essentials so you can get yourself started, however there are tons and tons of useful distributions for simulating different phenomena. Explore and go crazy!

The last level of the dungeon before reaching the end will be dedicated to talking a little bit about game design and randomness.

^[1] For continuous distributions, we cannot sum the weighted values (or differences from the mean) - we need to integrate them. However, the intuition for the formulas is the same.

^[2] If you want to do it kosher, the truncated normal distribution also exists, and there are easy enough accept-reject simulation methods to generate variables from those distributions.

^[3] I refuse to work with Imperial system measurements. If you studied in one of the three countries in the world who still stick to the Imperial system, sorry about that.

^[4] Theoretically, it can be used for all invertible distributions; practically, it can be used for all distributions for which you have a closed form or numerical algorithm for the inverse cumulative probability distribution function.