A basic outline of regression analysis
What can statistical techniques tell us about the relationship between two variables?
This guide is one in a series on different aspects of statistical literacy. The others can be found in the House of Commons Library's Good Information Toolkit.
Regression analysis is a technique that statisticians can use to establish with reasonable certainty the strength of a relationship between two variables.
Social scientists are often interested in analysing whether a relationship exists between two variables in a population. For instance, is greater corruption control associated with higher gross domestic product per capita. Does increased per capita health expenditure lead to lower rates of infant mortality? Statistics can give us information about the strength of association between two variables. This means that statistics can sometimes help us decide if there is a causal relationship, but they are not sufficient to establish this on their own.
A causal relationship between two variables (or lack thereof) can be established with more certainty by using regression analysis.
What are dependent and independent variables?Relationships are often expressed in terms of a dependent or response variable (Y) and one or more independent or describing variables (Xi).
The dependent variable is the condition against which certain effects are measured.
Independent variables are variables that are tested to see what association, if any, they have with the dependent variable.
A scatter plot of Y against X can give a very general picture of the relationship between dependent and independent variables, but this is rarely more than a starting point.
What is a simple linear regression model?A simple linear regression model is a form of regression analysis.
A simple linear regression model can help statisticians to establish the direction, size and significance of a potential association between a dependent and an independent variable (or independent variables).
A ‘simple linear regression model’ is so called:
- ‘simple’ because one explanatory (independent) variable is being tested, rather than several
- ‘linear’ because it is assessing whether there is a straight-line association between X and Y
- ‘model’ because the process is a simplified yet useful abstraction of the complex processes that determine the values of Y
The typical notation for a simple regression model is in the form of an equation of a straight line:
Yi = α + βXi + ε
Where:
- Yi is the dependent variable
- Xi is the independent variable
- α (alpha) is the intercept of the regression line (that is, the value of Y when β equals zero)
- β (beta) is the regression line coefficient (that is, the slope of the line that shows the increase (or decrease) in Y given a one-unit increase in Xi)
- ε (epsilon) is the error term (that is, the distance between the actual value of Y and the predicted value
These parameters are estimated to reduce ε2 and produce a line of best fit. The elements of this equation are illustrated in the chart below.
Regression coefficientThe most important parameter to test is β, the regression coefficient. As with confidence intervals and statistical significance the starting point is hypothesis testing:
- Null hypothesis (H0): β = 0 There is no (linear) association between X and Y in the population
- Alternative hypothesis (Ha): β ≠ 0There is some (linear) association between X and Y in the population
When a researcher rejects the null hypothesis, they are saying that the probability of observing a test statistic as large as that observed in the standard test statistic - with the null hypothesis being true - is so small that the researcher feels confident enough to reject it. This probability (often known as a significance level) is often 0.05, or 0.1, but can be as low as 0.01 (in medical trials).
Correlation coefficient‘The Pearson product moment correlation coefficient’ (or simply ‘the correlation coefficient’) is a measure of the strength of the association between the two variables. The correlation coefficient is denoted ‘R’ and can take values from –1 to +1:
- A positive figure indicates a positive correlation: an upward-sloping line of best fit, and vice versa. If R equals 0, then there is a complete lack of linear correlation.
In practice, such extreme results are unlikely, and the general rule is that values closer to +1 or –1 indicate a stronger association. Some examples are illustrated in the charts below.
Coefficient of determinationR2, sometimes known as ‘the coefficient of determination’, measures the proportion of the variance in Y that is explained by X. Values closer to 1 indicate a closer association, meaning that X is better at predicting Y.
R2 is sometimes given as the sole or most important measure of the association between X and Y, and hence the usefulness of the model. The interpretation of a particular value of R2 is not purely statistical, and a high value does not necessarily mean that X causes Y or that it is a meaningful explanation of Y.
Associations in the social sciences tend to have smaller R2 values than those in the physical sciences because they are dealing with human factors that often involve more unexplained variation. The R2 value is not a measure of how well the regression model fits, and a useful model can have a low value.
Association and causationIn his classic essay on causation, Sir Austin Bradford Hill set out his views of the most important factors to consider when deciding whether an observed statistical association is due to causation. These are given in descending order of importance:
Strength: what increase in cases of the potential effect or outcome is observed when the potential cause is present? Strength here refers to differences in the instances of the effect or outcome, not the statistical strength of any association which has to be ‘significant’ and not down to chance before looking at a hypothesis of causation.
Consistency: has the finding been repeatedly observed, by different people, at different times and under different circumstances?
Specificity: how specific is the potential effect? Is it limited to particular groups? Is the potential cause associated with other outcomes? A high degree of specificity can support the case for a causal hypothesis, but such clear, simple and distinct one-to-one relationships are rare.
Temporality: in what order did the event happen? An effect needs to come after a cause.
‘Biological gradient’: is the effect stronger where the potential cause is stronger (more intense, longer duration of exposure and so on), a so-called dose-response curve?
Plausibility: is there a plausible theory behind the hypothesis of causation?
Coherence: does the hypothesis make sense given current knowledge and related observations?
Experiment: is there any experimental evidence specifically connected to the hypothesis?
Analogy: are there any similar causal relationships?
Example: Primary Care Trust deficits and health allocation per head
You have obtained data for all Primary Care Trusts (PCTs). Data includes outturn as a proportion of PCT turnover and health allocation per head of population. How can you tell whether there is an association between the variables?
The scatter plot below suggests a possible positive relationship, which makes some intuitive sense, but the points do not appear to form anything like a straight line. We can therefore expect that only a small proportion of the variation in Y is explained by X. We cannot draw any firm conclusions without calculating and testing the model parameters.
The regression model formula is:
Yot = α + βXall + ε
Where:
- Yot is the PCT outturn expressed as surplus/deficit as a percentage of turnover
- Xall the allocation per head
- α is the intercept (that is, the outturn level if funding per head is zero)
Null hypothesis (H0):
- β = 0 (there is no (linear) association between PCT outturn and allocation per head in the population)
Alternative hypothesis (Ha):
- β ≠ 0 (there is some (linear) association between PCT outturn and allocation per head in the population)
The model parameters and other regression statistics can be calculated in Excel (Tools; Add-ins; tick Analysis Toolpack; then back to Tools; Data Analysis; Regression)
Regression output format should be similar to this.
Interpreting the output Regression coefficient
The coefficients column enables us to state the regression model. In this case:
- Yot = –0.066 + 0.00004Xall + ε
Each £1 increase in health allocation is estimated to result in a 0.00004 increase in PCT outturn as a proportion (percentage) of turnover.
Alternatively, a £100 increase in allocation per head is estimated to result in an increase in turnover of 0.4 percentage points. This line is illustrated in the next chart.
The test of statistical significance is to calculate a t-statistic (given in the regression output above). The probability (p-value) of observing a t-test statistic as high as 5.6 gives the answer to the hypothesis test:
- H0: β = 0 is <0.001 (P=0.0000005)
The probability that there is no linear association is extremely low, so we can therefore reject the null hypothesis. There is therefore some linear association between PCT outturn as a percentage of turnover and health allocations per head.
The output shows a positive linear association between outturn and allocation per head among the population (what we expected before the analysis). As the p-value is so low, this relationship is significant at the 5% (or even 0.1%) level of significance. The 95% confidence interval for β is also given in the regression output: 0.000026 to 0.000055. Alternatively, the confidence interval of an increase of £100 in allocation per head is 0.26 to 0.55 percentage points.
Coefficient of determinationIn this case, R2 equals 0.09, suggesting that 9% of the variation in PCT outturns as a proportion of turnover are explained by changes in allocation per head. This is a low degree of explanation and confirms the visual interpretation of a wide spread of results around the regression line. There is a general positive association, but allocations per head explain little of the variance in levels of deficit. The regression model therefore does not help to explain much of what is going on here, and is poor at predicting levels of deficit from allocations in its current form. This might imply that other explanatory variables could be usefully added to the model. Where more variables are added, this is called “multiple regression”.
Footnotes