Measures of average and spread
What do we mean by the term average? What are the different types of average and measures of spread? How are these calculated?
This guide is one in a series on different aspects of statistical literacy. The others can be found in the House of Commons Library's Good Information Toolkit.
Taking an average is a common way of summarising a set of values: it gives a single value calculated to represent what is typical across a dataset. These are usually somewhere in the middle of the dataset, so averages are also called measures of central tendency.
This makes it easier to talk about trends in datasets. For example, it is easier to consider how the average household income has increased since 2000 than to look at two large sets of data showing individual households’ incomes in each year.
Averages are often accompanied by measures of spread. While the average gives an idea of the middle of the dataset, measures of spread give an idea of the extremes, such as the largest and smallest values in the set.
Types of averageThere are three often-used measures of average:
- Mean – this is typically what we, in everyday language, would think of as the average of a set of figures, and is a ‘central’ representation of the values
- Median – the middle value of a dataset
- Mode – the most common value in a dataset
1: To calculate the mean, add all the numbers together and divide this figure by how many values there are
2: To calculate the median, arrange the numbers in order (rank) and find the middle number
3: To calculate the mode, count how often each value appears, and the number that appears the most often is the mode
MeanThe mean is the most commonly used form of average and is usually referring to the ‘arithmetic mean’.
The mean is calculated by adding up all the figures in a dataset, and then dividing this sum by the number of values in the dataset. The following example explains this:
What should you consider when calculating the mean?The mean can be affected by unusually high or low values in the dataset, known as outliers. As a result, the mean average may not always be typical of the values in the dataset.
For example, in the above data, if the individual earning £15.65 per hour had instead earned £21 the mean earnings would have been £14.83 per hour. This would not have been typical of the group because it is a much higher wage than much of the rest of the group are receiving.
The mean is often useful as a base for further calculation, such as estimating the cost or effect of a change. For example, if everyone’s pay increased by 10%, the mean would also increase by 10%, meaning an employer could calculate the cost of increasing everyone’s pay just by using the mean. This isn’t the case for other averages, like median and mode.
MedianIf we instead want to give a more typical, or middle, value to represent the dataset, we may want to use the median rather than the mean. The median is less affected by values at the extremes than the mean, so it can sometimes be a better reflection of typical values.
To calculate the median we arrange the figures in rank order (from smallest to largest) and take the middle value. If there is no middle value because there is an even number of figures, then the median is usually taken to be mid-way between the two middle points.
If we take the same earnings example, but with an outlier value included, we can see the usefulness of calculating the median. For example, the median for the following data is £13.85, as this is the middle value:
If we calculated the mean for the above values, it would be £14.83, which would not be typical for most employees.
ModeThe mode is the value which occurs most frequently in the dataset.
If we take the example of employee’s pay again, but with more values, we can see how to find the mode:
Hourly rate of pay Employee Pay/hour 1 £12.20 2 £12.60 3 £13.10 4 £13.10 5 £13.95 6 £13.95 7 £13.95 8 £14.25 9 £15.10£13.95 is the value which appears most within our dataset, so is our mode value.
In real life, we use the mode more often than we realise. For example, the ‘top 10’, ‘most popular’, or ‘second favourite’ are simply measures which are looking at the most common (or second most common) values – and so are modal measures.
Weighted averagesAn average calculated as the arithmetic mean assumes equal importance of the items for which the average is being calculated. But sometimes it is important to account for differences in size or importance of figures when making this calculation.
For example, when looking at incomes, if the average income of junior employees at a company was £280 per week and the average for senior employees was £300 – it would be wrong to say that the average for all employees was £290: [(280+300)/2.
This is because, hypothetically, there are around twice as many junior employees in this company than senior employees and this needs to be taken into account in calculating the overall average. If we give twice as much weight to the value for junior employees than for senior employees, the overall average comes to £287.
The calculation of this is set out below:
Other averagesThere are other measures of average, which are less commonly used, some of which are briefly described :
- Geometric mean – the nth root of the product of n data values (for example, for two values, multiply them together and take the square root). This may be used, for example, to minimise the effects of extreme values when calculating growth rates.
- Harmonic mean – the reciprocal of the arithmetic mean of the reciprocals of the data values. This is recommended by the Environmental Protection Agency (US) as a means of setting maximum toxin levels in water.
- Quadratic mean or root mean square (RMS) – the square root of the arithmetic mean of the squares of the data values
- Generalised mean – generalising the above, the nth root of the arithmetic mean of the nth powers of the data values
- Weighted mean – an arithmetic mean that incorporates weighting to certain data elements
- Truncated mean – this discards the highest and lowest values in the dataset, and then calculates the arithmetic mean from the values left over. This is used, for example, in ice skating scoring.
- Interquartile mean – a type of truncated mean where the highest 25% and lowest 25% of the dataset is excluded (what is left is the interquartile range, see below)
- Winsorised mean – similar to the truncated mean, but rather than removing the extreme values, they are capped at specific values
- Midrange – the arithmetic mean of the highest and lowest values of the data or distribution
An average gives us a central value that we can use to represent a whole dataset. Measures of variation and spread tell us how far the rest of the values in that dataset are from that average, which helps us assess how representative the average is of the full range of values.
RangeThe range is a measure of spread in data and is calculated as the difference between the largest and smallest values in a dataset. If we take our employee pay data, we can see an example of this below:
QuantilesIf data are arranged in order we can give more information about the spread by finding values that lie at various intermediate points. These points are known generically as quantiles.
For example the values that split the dataset into four groups, each containing 25% of the data, are called the quartiles. Similarly, splitting the data into ten equal-sized groups gives deciles, five groups gives quintiles, and 100 groups gives percentiles. (In practice, it is unlikely that you would want to look at all 100 percentiles, but sometimes the boundary for the top or bottom 5% or other value is of particular interest).
One commonly used measure is the interquartile range. This is the difference between the boundary of the top and bottom quartile. This is therefore a range that encompasses 50% of the values in a dataset.
Standard deviation and varianceFor each value in a dataset it is possible to calculate the difference between it and the average (usually the mean) to get what we call the ‘mean deviation’. This can give us an indication of the variability in a dataset.
However, it is more common to instead use the ‘standard deviation’ (or variance) to measure spread, or variability, in a dataset. It is useful in comparing sets of data which may have the same mean but a different range.
deviation of the values from their arithmetic mean. To calculate the RMS, take the difference between each value and the mean, square them, add them together, and take the square root of the sum.
If the data points are all close to the mean, then the standard deviation is close to zero. If many data points are far from the mean, then the standard deviation is far from zero. If all the data values are equal, then the standard deviation is zero. Where two sets of data have different means, it is possible to compare their spread by looking at the standard deviation as a percentage of the mean.
The standard deviation is especially important for data that is ‘normally distributed’ (where the distribution is in the shape of a bell). In a normal distribution, 68% of values are within one standard deviation of the mean, 95% are within two standard deviations, and 99.7% are within three standard deviations. This BBC article on types of variation includes an example of normal distribution.
This principle underpins a lot of statistical work where samples of a population are used to estimate values for the population as a whole (for further details see the statistical literacy guide on Confidence intervals and statistical significance.
There are various formulas and ways of calculating the standard deviation – these can be found in most statistics textbooks or online, for example, on BBC Bitesize.
Excel functions to calculate average and spreadWhile it is possible to calculate these from the above principles, there are a number of statistical functions in Excel which can be useful shortcuts to calculate averages and spread. Some of these are highlighted below:
Function ExplanationAVERAGE
Returns the average (arithmetic mean) of its arguments
COUNT
Counts how many numbers are in the list of arguments
LARGE
Returns the k-th largest value in a data set, where you determine k (for example, the 5th largest number)
MAX
Returns the largest (maximum) value in a list of values
MEDIAN
Returns the median of the given numbers
MIN
Returns the smallest (minimum) value in a list of arguments
MODE.MULT
Returns the most frequently occurring values in a data set
MODE.SNGL
Returns the single most frequently occurring value in a dataset
PERCENTILE.EXC
Returns the k-th percentile of values in a range, where k is in the range 0..1, exclusive
PERCENTILE.INC
Returns the k-th percentile of values in a range, where k is in the range 0..1, inclusive
PERCENTRANK.EXC
Returns the rank of a value in a data set as a percentage (0..1, exclusive)
PERCENTRANK.INC
Returns the rank of a value in a data set as a percentage (0..1, inclusive)
QUARTILE.EXC
Returns the quartile of a data set, based on percentile values from 0..1, exclusive
QUARTILE.INC
Returns the quartile of a data set, based on percentile values from 0..1, inclusive
RANK.AVG
Returns the rank of a number in a list of numbers: its size relative to the other values in the list, and will return the average rank if more than one value has the same rank
RANK.EQ
Returns the rank of a number in a list of numbers: its size relative to the other values in the list, and will return the top rank of that set of values if more than one value has the same rank
SMALL
Returns the k-th smallest value in a dataset (for example, the 5th smallest number)
STDEV.P
Calculates standard deviation based on the entire population
STDEV.S
Estimates standard deviation based on a sample
Footnotes