Confidence intervals and statistical significance
What are confidence intervals? Why is statistical significance so important? How are these two concepts related?
This guide is one in a series on different aspects of statistical literacy. The others can be found in the House of Commons Library's Good Information Toolkit.
In statistics it is important to measure how confident we can be in the results of a survey or experiment. Confidence intervals and measures of statistical significance are ways of demonstrating the reliability of a statistical finding.
Statistical estimates are sometimes presented with a 95% confidence interval (CI) around a central figure. For example, the estimated employment rate for August to October 2025 was 75.1% (95% CI ±0.5%), meaning there is a 95% chance that the employment rate for the whole population was between 74.6% and 75.6%.
A result is statistically significant if it is unlikely to have occurred by chance. This is reported in statistical testing using a p-value, which normally gives the probability of getting the observed (or more extreme) results if there is no real effect. A small p-value means there is only a small possibility that the result could have happened by chance. So, for instance, a p-value of 0.01 means we’d only expect this result 1% of the time if no real effect exists. Here we would infer that the relationship being tested is statistically significant.
What are confidence intervals?One way of measuring the degree of confidence in statistical results is to look at the confidence interval reported by researchers.
Confidence intervals are usually presented as a range around a central average or estimate; they are the standard way of expressing the statistical accuracy of an estimate based on a survey or sampling (that is, measuring something for a sample of the entire population rather than for the whole population directly). Wider confidence intervals mean the estimate value is less likely to be accurate.
If an estimate is based on sampling, the 95% confidence interval describes the range within which the true population value should fall 95% of the time when taking different samples from the same population.
The 95% confidence interval is the most common, but researchers can also report other confidence intervals (such as 99%, 90% or 80%). A larger percentage confidence interval, say 99% rather than 95%, will give a wider range for the same data. This gives the higher level of confidence that the range includes the true population value.
Where survey or test results are based on a larger sample size or results are less variable, the confidence interval will be smaller, other things equal.
Calculating confidence intervalsConfidence intervals are based on the characteristics of a normal distribution. The chart below shows what a normal distribution of results is expected to look like. If a statistic follows a normal distribution then we would expect 95% of observations to lie within around two standard deviations of the mean.
We calculate a 95% confidence interval using a method derived from this ‘rule’. It is the mean ±(1.96 multiplied by the standard error of the mean). The standard error of the mean is the standard deviation divided by the square root of the sample size and is a measure of the precision of a sample mean, rather than the spread of data points.
Worked exampleTo calculate a 95% confidence interval we need to calculate the mean average and standard error from a set of data. The standard error is calculated from the standard deviation of the data which itself is the square root of the variance.
Variance is calculated by measuring the difference between each measurement and the mean value of all measurements, squaring them, adding them together, and dividing them by the number of measurements. That is, it is the average squared deviation from the mean.
The table below gives an example of this calculation for a sample of 10 people asked their salaries.
The mean salary is the sum of all salaries, shown in the first column, divided by the number of respondents.
This is then subtracted from each individual value to get their difference from the mean, shown in the second column. The sum of these values is always zero, as the total of all the negative values equals the total of the positive values. The same is true of their mean value. This is why the deviations are squared to get a non-zero total.
The squared deviations and the average of them (the variance) are shown in the third column. The standard deviation is the square root of the variance, which here is £18,977.
The standard error of the mean is £18,977 divided by the square root of the sample size. This is £6,001
The 95% confidence interval for the mean salary is therefore £45,188 ± (1.96 multiplied by £6,001 ), which is £33,430 to £65,950.
What is statistical significance?A result is statistically significant if it is unlikely to have occurred due to random chance, suggesting a real effect.
A finding may be statistically significant without being ‘substantively significant’, that is, important or meaningful. For instance, in very large samples, it is common to discover many statistically significant findings where the sizes of the effects are so small that they are meaningless or trivial.
Assessing statistical significanceSignificance testing is normally used to establish whether a statistical result is likely to have occurred by chance. This may be to test whether the difference between two averages (for example, the mortality rates for two medical procedures) is ‘real’ or whether there is a relationship between two or more variables.
The key to most significance testing is to establish whether the results support or reject the null hypothesis. The null hypothesis refers to any hypothesis that the research aims to prove wrong, and it normally presumes there is no difference in averages or no correlation between variables. For example, in a study of the effects of consuming alcohol on the ability to drive a car, the null hypothesis would be that consuming alcohol has no effect on an individual’s ability to drive.
The most common significance level to show that a finding is good enough to be believed is 5% (written as p<0.05). This means that the probability (p‑value) of the observed result occurring is less than 0.05 (or 5%) if the null hypothesis is true. Where findings meet this criterion, it is normally inferred that the null hypothesis is false.
While the 5% level is standard across most social sciences, the 1% level (p<0.01) is also fairly common. At this level it is harder to find a statistically significant result.
Types of error in significance testingA type 1 error, also known as a false positive, is when the null hypothesis is rejected when it is actually true. In most cases this means that there is no real effect, but the test indicates that there is. Much null hypothesis testing is aimed at reducing the possibility of a type 1 error by testing against a smaller p-value and only rejecting the null hypothesis if there is a smaller chance of a result occurring due to chance. The aim of this is to reduce the possibility of falsely claiming some connection.
A type 2 error, also known as a false negative, occurs when a null hypothesis is accepted when it is actually false. In most cases this will mean that results are not down to chance alone, there is a real effect, but the test did not detect this.
The statistical ‘power’ of a test is one minus the type 2 error rate and is the probability of correctly rejecting a false null hypothesis (a ‘true positive’ finding). A higher power raises the chances that a test will be conclusive. It is not common for the type II error rate or statistical power to be calculated in significance testing.
Convention in many areas of social science is that type 2 errors are preferable to type 1 errors. There is a trade-off between them as type 1 errors can be reduced by setting a very low significance level (p<0.01 or p<0.001) but this increases the likelihood of type 2 error (that a false null hypothesis will be accepted).
Example of basic significance testingTo revisit our example, a survey of 10 people asked respondents their current salary but also their age, in order to investigate whether age is associated with salary. The results are shown below.
Here our null hypothesis is that age is not associated with salary (and the significance level is 5%). Using a simple linear regression we find that income increased on average by £1,150 for each additional year of age. The p-value for this was 0.0025. Therefore the result is statistically significant at the 5% level (as p<0.05) and we reject the null hypothesis (that there is no connection between age and salary).
Researchers don’t always report the size of this association (called the β‑value), but it helps the author start to establish importance or substantive significance if they report it. Just as important is the confidence interval of the estimate. The 95% confidence interval of β in this example is £535 to £1,760, so we expect that the true value would fall in this range 95% of the time. This tells the reader both about the size of the effect and illustrates the level of uncertainty of the estimate.
There is a precise link between significance levels and confidence intervals. If the 95% confidence interval includes the value of β assumed for the null hypothesis (here zero) then p≥0.05 and the null hypothesis is not rejected at the 5% level. Similarly, if the 99% confidence interval included zero then the hypothesis would not be rejected at the 1% level.
Commentary on significance testingThe type of null hypothesis testing described here is the one that most readers are likely to find in the social sciences and medicine.
Some authors have criticised the way researchers sometimes conduct this significance testing (PDF), for example, misinterpreting results, not reporting the size of coefficients in a linear regression, focusing on random error instead of other sources of error, not stating alternative hypotheses and ignoring alternative types of significance testing.
The most important criticism is that some authors focus on statistical significance instead of substantive significance (or importance). For example, regression might find a statistically significant association between age and salary, but if salary only goes up by £1 per year of age, is this substantively significant or important?
Statistical significance is not sufficient for substantive significance in the field in question. It may also not be necessary in certain circumstances.