Reading Summary: Correlation and Regression (Munro, 2005)
Correlation is a statistical method used to determine whether two or more variables are related. The correlation between only two variables is called zero-order correlation, whereas high-order (multiple) correlation refers to a relationship between two or more independent variables and one dependent variable. The Pearson product moment correlation coefficient (r) is used to measure the strength of relationship that exists between variables.
Four assumptions must be met for r to be calculated: the sample must be representative of the population; the variables must be normally distributed; the distribution of the scores must have approximately equal variability; and there must be a linear relationship between the variables.
The r may range from +1.00 through 0.00 to -100. A +1.00 indicates a perfect positive relationship; 0.00 indicates no relationship, while -1.00 indicates perfect negative relationship. A relationship that has r value ranging from 0.00-0.25 is described as little or if any; 0.26-0.49 as low; 0.50-0.69 as moderate; 0.70-0.89 as high; and 0.90-1.00 as very high. That’s, the greater the magnitude of the r, the stronger the relationship, and the vice-versa.
The correlation coefficient also indicates the type of relationship that exists between the variables. A positive r indicates a positive relationship (i.e. the variables increase or decrease at the same time); negative r indicates a negative relationship (i.e. the variables are inversely proportional); and a zero r indicates no correlation (i.e. no pattern is discernable between the variables).
The probability that r occurred by chance alone (i.e. its significance value) must be determined before r can be used to draw conclusion for the entire population. In the case when low scores on one variable (X) are related to low scores on another variable (Y), but high scores on X are related to low scores on Y, the relationship is described as curvilinear. Curvilinear relationships can be tested with eta but not r.
Since the relationship between two or more variables can be influenced by extraneous factors, the partial correlation technique is often used to control these influences on all the variables involved. When the influence of extraneous factors is removed from only one of the variables, the technique is called semi-partial correlation.
Regression is used to predict the score on one variable given the score on (the) other variable(s), if there is significant correlation between the variables. The higher the magnitude of the correlation coefficient, the more accurate is the prediction. Simple regression is used when the relationship involves only two variables, whereas multiple regression is applied when more than two variables are correlated.
The equation of the simple regression line is the equation for a straight line: Y= a +bX, where Y is the predicted score, a = value of Y when X=0 and b= regression coefficient. Converting the raw scores to standardized z-scores simplifies the regression equation into: Y= rX, where r is the correlation coefficient. When pairs of scores are plotted on a scatter diagram, the regression line passes through the centre of the data pairs, and hence, referred to as the “line of best fit”. The closer the value r is to +1 or -1, the better the points fit the line.
Different types of variations are associated with the regression equation. The sum of squares of the vertical distances of each of the points from the mean is the total variation. Total variation is the sum of the variation obtained from the relationship between the data pairs (i.e. explained variation) and the variation due to chance. The ratio of the explained variation to the total variation is known as the coefficient of determination (R2).
In simple linear regression, the correlation between the two variables must be significant before accurate predictions can be made. With multiple correlations, the multiple correlation coefficient, the amount of variance accounted for (R2) and the different regression coefficients are all tested for significance in order to determine whether the independent variables influenced the variance accounted for the dependent variable. The significance of the multiple regression coefficient can be tested using the F-distribution, while the significance of each of the individual regression coefficients can be tested using either the F or t-distribution.
Categorical variables and interactions can be entered into the regression equation by means of coding without affecting the amount of the variance and its significance.
Munro, B.H. (2005). Statistical Methods for Health Care Research (5th ed.). Philadelphia: Lippincott Williams and Wilkins.