I am a full-time head statistician working in the public sector with around ten years experience in both public and private sector positions. Previously I’ve worked as an analyst in both research and information roles. I have a master’s degree in Economics and also a Bachelors degree in Business Economics. In terms of statistics I’m comfortable with all aspects of theory and techniques using the statistical package SPSS to analyse, manipulate and draw conclusions from large data sets. My current role also involves coaching and lecturing analysts on statistical techniques to aid their job development. Regarding economics I’m comfortable with a broad understanding of all economic theory to Masters Degree level, especially microeconomic, macroeconomic and historical/political theory.
The basic theory of correlation and how to calculate it manually.
The aim of this paper is to explain the basic theory of correlation and how it is calculated manually from a data set. The purpose is to give the reader a sound understanding of the subject and then be confident enough to calculate the correlation figure from any data set if and when required to aid the organisation.
It is often beneficial for analysts to know what kind of relationship exists, if any, between two or more variables (Field, 2001). To use an example it might be interesting to discover whether there is a relationship between the amount of time reading this paper (variable 1) and the readers understanding of correlation theory (variable 2).
The relation between these two variables can come in three different forms.
- Positively related (+): The more time reading the paper the greater the understanding of correlation theory.
- Negatively related (-): The more time spent reading this paper the worse their understanding of correlation becomes.
- Zero Correlation (0): The readers understanding of correlation remains the same regardless of how many times this paper is read.
The Correlation Coefficient, represented as r, tells us which of the three directional forms our variables relate. This numerically expresses the linear relationship between two variables somewhere between -1 and +1. The correlation coefficient is calculated from the covariance via a process called Standardisation. This whole method will be explained in a step by step guide throughout the paper.
1.1 How to Measure Relationships.
To begin with it is important to remember that the variance represents the average amount that the data varies from the mean. So if two variables are related then we would assume the shared variance moves in the same way, meaning when one variable deviates from the mean the other should deviate in a similar way whether that is positive or negative (Bissell, 1994).
To illustrate this, the table below shows a fictional sample of five athletes and subjects them to a certain number of adverts promoting a high energy sports drink and then measures how many of the drinks were purchased by the athletes over the following weeks. The data is subsequently collected and the mean and standard deviation are then calculated¹ see table 1.0 below.
Calculating the Covariance
The shared variance between any two variables is called the covariance and it is the first step to calculating the correlation coefficient. From the data in table 1.1, using subject 1 as an example, we calculate the adverts and purchasing data from their respective means.
Adverts = 5 – 5.4 = – 0.4
Drinks Purchased = 8 – 11 = – 3
The two results are multiplied together which is known as the Cross-Product Deviation figure.
-0.4 * -3 = 1.2
The rest of the subjects cross product deviations are calculated using the same method as above, the results are shown below in table 1.2.
The second step is to find the average value of the combined differences of the two variables. This is achieved by summing the cross product deviations and dividing by the number of observations (subjects) -1. The result is known as the Covariance.
1.2 + 2.8 + 1.4 + 1.2 + 10.4
With the covariance figure now calculated it gives a good measure to asses the directional relationship between the two variables. The result 4.25 is a positive number so this tells us that as one variable deviates from the mean, the other variable deviates in the same direction. Alternatively a negative covariance would suggest as one variable deviates from the mean, the other variable will deviate in the opposite direction, (as one increase the other will decrease)
1.3 Standardisation and Calculating the Correlation Coefficient (r)
The covariance is helpful to understand the direction but is fundamentally flawed in that it is dependent upon the scale of measurement used and is therefore not a standardised measure. The importance of having standardised values is perhaps explained clearest by Field (2001). Field gives the example of what happens if we use data that is represented by two variables measured in miles with a covariance value of 4.25. If this data is converted to kilometres (by multiplying the value by 1.609) and calculated again, the covariance would increases to 11. The dependence on the scale of measurement is a problem as it means we are unable to compare covariances in an objective way, therefore cannot state whether a covariance is particularly large or small relative to another data set unless both data sets were measured in the same units. The take home point regarding standardisation is it allows you to compare effect sizes across different studies that have been measured in different scales.
The measurement of scale problem is resolved by standardising the covariance. This means converting the covariance into a standard set of units. To achieve this we need a unit of measurement into which any scale of measurement can be converted. The unit of measurement we use is called the Standard Deviation.
The standard deviation is a measure of the average deviation from the mean. If we divide any distance from the mean by the standard deviation it gives us that distance in standard deviation units. For example, from the data in table 1.1, the standard deviation for drinks purchased is 3 (rounded from 2.92); we can also see that the observed distance from the mean for subject 1 is -3 (8-11). Therefore we can conclude that the error for subject 1 was -3 drinks. If we divide this distance by the standard deviation of 3 (2.92) then we get -1. So it’s accurate to state that the difference between subject 1’s score and the mean was -1 standard deviation.
¹ Mean is calculated by adding all the ‘adverts’ data in table 1.0 together and dividing by the count of ‘adverts’. For the Standard deviation take each separate data point from their respective means, square them, then add all together and divide by the population -1, then square root the result. (See Appendix A for more detail)
The same process occurs when wanting to express the covariance into a standard unit. However because we are dealing with two variables there are two standard deviations, therefore we need to multiply them together. Once this is achieved we use our covariance figure calculated earlier and divide by the sum of both multiplied standard deviations. The result becomes the standardised covariance and is known as the Correlation Coefficient, which stated earlier is always represented by the letter r.
To use the example data in table 1.1: multiplying both standard deviations together
= 1.67 * 2.92
Then divide the Covariance by the multiplied standard deviations
= 4.25 / 4.8764
r = 0.87
The resulting figure is the correlation coefficient and represents your effect size. By standardizing our data we have a score between -1 and +1. Perfect positive correlation is +1, perfect negative correlation is -1 and a 0 indicated no correlation, meaning if one variable changes the other stays the same. It is widely accepted that r = +-0.10 is considered a small effect; +-0.30 is considered a medium effect and +-0.5 or above is a large effect (Howell, 2002).
1.4 Importance of Causality
With a correlation coefficient (r) of .87 it can be clearly stated that advertising has a large effect on athletes regarding the purchasing of sport drinks. However, when quoting the correlation coefficient we must take some consideration regarding causality. The r value gives no indication of the direction of causality. In the sports drink example we cannot clearly say that sports drink advertising causes the athlete to purchase the drinks, only that it has an effect, albeit large. This is because of the other or third variable problem. In any relationship between two variables, causality cannot be assumed because there may be other measured or unmeasured variables affecting the result (Gould, 1981). In our example, the purchasing of sports drinks could be explained equally well be other unmeasured variables such as, usage by peers or information by coaches to name a few.
To get a better grasp on the importance of the third variable problem consider the example provided by Field and Hole (2003) who cite Brocas’s finding of a strong correlation between gender and brain size, with woman having smaller brains on the whole than men. At the time this study was used to argue that woman were intellectually inferior to men. However the initial research made absolute no account of body size (the third variable): people with bigger bodies have bigger brains irrespective of intellectual ability. So the findings were not as strong as suggested. This was further proved when the original data was analysed again by Gould (1981) and proved that the strong correlation between gender and brain size completely disappeared when you account for body size.
Hopefully this example highlights the issue of causality. Quote the correlation coefficient but make sure it is realised that this is highlighting the effect size only and makes no reference to causality.
1.5 Using R² for Interpretation
Aside from causality we can still use the correlation coefficient to give us more information by simply squaring it. By doing this the r value becomes known as the coefficient of determination R². This is useful because it shows the proportion of the variance (fluctuation) of one variable that is predictable from the other variable. So again using our example, the advertising effect on drinks purchased will vary from athlete to athlete because of a number of different factors (taste for example). If this variability was added up it would provide an estimate of the total variability that exists.
Our Correlation Coefficient was 0.87, so
= 0.87 *0.87
By multiplying by 100 we can state that advertising accounts for 75.6% of the variability in drinks purchasing, leaving only 24.4% of variability to be accounted for by other variables. This is a very high percentage rate and proves further the strong relationship between the two variables.
In summation, correlation is the measure of a linear relationship between variables. Initially we looked at the shared variance, which is called the covariance and can be described as a crude measure of a relationship between variables. Through the standardisation process the correlation coefficient can be calculated from the covariance figure, expressed as r. The correlation coefficient has to lie between -1 and +1; any other score at this point means a miss calculation has been done prior to this stage. A coefficient of +1 indicates perfect correlation; alternatively a -1 coefficient indicates a perfectly negative relationship. A correlation coefficient of 0 indicates no linear relationship at all. Remember that the correlation coefficient (r) represents your effect size. To summarise your findings, an effect size of +/-0.10 is considered a small effect; +/-0.30 is considered a medium effect and +/-0.5 or above is a large effect. The coefficient can be taken a step further by squaring your result which will give you the percentage of variability that can be explained by the variables in your analysing. Finally remember to mention causality to compliment your results.
Bissell, D. (1994). Statistical Methods for SPC & TQM., London: Chapman and Hall
Field, A. (2001). Discovering Statistics Using SPSS: 2nd edition, London: Sage
Howell, D.C. (2002) Statistical Methods for Psychology: 5th Edition, CA: Belmont
Gould, S.J. (1981) The Mis measure of Man. London: Penguin.
Field, A. & Hole, G.J. (2003) How to design and report experiments: London: Sage
Appendix A – Step by step guide to manually calculate the Standard Deviation
(1) Subtracted each individual data point from the respective means
Adverts Drinks Purchased
5 – 5.4 = -0.4 8 – 11 = -3
4 – 5.4 = -1.4 9 – 11 = -2
4 – 5.4 = -1.4 10 – 11 = -1
6 – 5.4 = 0.6 13 – 11 = 2
8 – 5.4 = 2.6 15 – 11 = 4
(2) Square each result and sum
-0.4 * -0.4 = 0.16 -3 * -3 = 9
-1.4 * -1.4 = 1.96 -2 * -2 = 4
-1.4 * -1.4 = 1.96 -1 * -1 = 1
0.6 * 0.6 = 0.36 2 * 2 = 4
2.6 * 2.6 = 6.76 4 * 4 = 16
(3) Divide each sum by its population-1 to get the variance figure.
11.2 / 4 = 2.8 34 / 4 = 8.5
(4) The Standard Deviation is the Square Root of the variance
Adverts Drinks Purchased
√ 2.8 = 1.67 √ 8.5 = 2.92