Objectives
SL 4.4
Linear correlation of bivariate data.
Pearson’s product-moment correlation coefficient, r.
Scatter diagrams; lines of best fit, by eye, passing through the mean point.
Equation of the regression line of y on x.
Use of the equation of the regression line for prediction purposes.
Interpret the meaning of the parameters, a and b, in a linear regression y=ax+b.
Introduction
Understanding the concept of correlation is essential in the IB course. Correlation is a statistical measure that describes the degree to which two variables move in relation to each other.
Correlation measures the relationship between two variables and how one variable changes in response to the other. Correlation does not establish causation. This is an important point to remember.
Types of Correlation
There are three types of correlation:
Positive correlation - both variables increase together (hours studied vs. exam scores).
Negative correlation - one variable increases while the other decreases (outdoor temperature vs. heating bills).
Zero correlation - no observable relationship (shoe size vs. grades).
Watch Revision Village Correlation between two variables video.
Line of Best Fit (Least Square Regression Line)
This should be revision for you so I won't spend too much time on this. But there is a line y = ax + b that best fits our data, and that is known as the least square regression line you have may have called it the line of best fit in previous studies.
Watch Minity Maths Line of Best Fit.
Work through Khans Academy Statistics and Probability Unit 5 Introduction to Trend Lines and Quiz 3.
Thankfully, you can do all this on your GDC! Watch CasioEducations Creating a Table and Finding Regression.
Once we have the regression line, it is possible to predict y values for x values. Estimating y-values within the range of data is known as interpolation. Estimating y-values outside the range is known as extrapolation. Interpolations are generally more reliable than extrapolations.
Practice
Complete the worksheet Scatter Graphs on Corbett Maths. Answer each question and also use your GDC to find the regression line.
Thinking Point
Why is it important to understand correlation in real-world contexts?
Can you think of examples where correlation might be misinterpreted?
Spend a few minutes thinking about these questions and briefly answer them in your journal.
If you need help thinking through possible discussion points you can use the prompt sheet. I encourage you to have a go at researching these questions and trying to develop an argument yourself before looking atthe prompts.
Pearson's Correlation Coefficient (r)
This is the most commonly used method of calculating correlation. Pearson's correlation coefficient (r) is a numerical measure of the strength and direction of a linear relationship between two variables. Values range from −1-1, with −1 (perfect negative correlation) to 1 (perfect positive correlation), with 0 indicating no correlation.
This can only be used for linear data. Linear data is a straight-line graph that shows a relationship between two or more variables.
Pearson's correlation coefficient (r) indicates the degree of linear relationship between two variables.
When determining if a correlation is strong, moderate or weak, we can use the r value.
Strength categories:
Strong - 0.8≤∣r∣≤10.8
Moderate - 0.5≤∣r∣<0.80.5
Weak - 0≤∣r∣<0.50
The direction of correlation:
Positive r = positive correlation
Negative r = negative correlation
Watch Mathslessons How to find r and line of best fit.
Formula
You will NOT be expected to calculate the Pearson's Correlation Coefficient manually. You will need to be able to use your GDC to do so. However, knowing how the formula works is helpful.
Practice
Complete Khans Academy Statistics and Probability Unit 5 Correlation coefficients and Quiz 1.
Complete section 12.3 in Unit 12 of Statistics in the MEP book.
Spearman's Rank Correlation Coefficient
When data is not linear or it is ordinal, Spearman's rank correlation is used. It assesses how well the relationship between two variables can be described using a monotonic function. The formula is:
Watch OSC SPearman's rank coefficient video.
Spearman's rank coefficient is much less sensitive to outliers than Pearson's. So, after analysing the data and determining that you should keep the outlier, then using Spearman's will give a more reliable result.
When ranking variables you may come across a situation where the rank is tied, maybe 3 students are tied for rank 3. In this case you average the ranks they would be in. So the first student would be 3, the second 4 and the third 5. 3+4+5 = 12/3 = 4. So all three students would be ranked 4.
Complete the MEP worksheet.
Complete section 12.4 in Unit 12 of Statistics in the MEP book.
Thinking Point
Watch the TEDed video How Statistics can be Misleading.
And Tedx The Dangers of Mixing up Correlation and Causation.
Think about the following questions and write a response in your journal.
How are unethical practices, such as “data dredging,” used by statisticians to deliberately manipulate and mislead people?
What steps can we take to help ourselves avoid being misled by statistics used in unclear or disingenuous ways in the media?
If you need help thinking through possible discussion points you can use the prompt sheet. I encourage you to have a go at researching these questions and trying to develop an argument yourself before looking atthe prompts.
Comments