Author information Copyright and License information Disclaimer Copyright : © Perspectives in Clinical Research This is an open access article distributed under the terms of the Creative Commons Attribution-NonCommercial-ShareAlike 3.0 License, which allows others to remix, tweak, and build upon the work non-commercially, as long as the author is credited and the new creations are licensed under the identical terms. Correlation is a statistical technique which shows whether and how strongly two continuous variables are related. In this article, which is the eighth part in a series on ‘Common pitfalls in Statistical Analysis’, we look at the interpretation of the correlation coefficient and examine various situations in which the use of technique of correlation may be inappropriate. Keywords: Biostatistics, correlation, “data interpretation, statistical” We often have information on two numeric characteristics for each member of a group and are interested in finding the degree of association between these characteristics. For instance, an obstetrician may decide to look up the records of women who delivered in her hospital in the previous year to find out whether there is a relationship between their family incomes and the birth weights of their babies. The relationship here means whether the two variables fluctuate together, i.e., does the birth weight increase (or decrease) as the income increases. “Correlation” is a statistical tool used to assess the degree of association of two quantitative variables measured in each member of a group. Although it is a very commonly used tool in medical literature, it is also often misunderstood. This piece describes what “correlation” implies and the situations in which it may be used, as also its pitfalls and the situations where it should not be used. To illustrate various concepts, we use scatter plots, a graphical method of showing values of two variables for each individual in a group. The degree of correlation between any two variables on a continuous scale is mathematically expressed as the correlation coefficient (also known as Pearson's correlation coefficient or “r”), a number whose values can vary between −1.0 and +1.0. Thus, it has a sign (+ or −) and a magnitude. Two variables are said to be “positively” correlated [Figure 1a-c] when their values change in tandem, i.e., increasing values of one are associated with increasing values of the other. By contrast, a “negative” correlation [Figure 1d-f] exists when increasing values of one variable are associated with a decrease in the values of the other. Variables with no or little discernible relationship [Figure 1g] are said to have “no correlation.” Scatter plots of relationship between values of two quantitative variables and their corresponding correlation coefficient (r) values. “r” can vary between − 1.0 and + 1.0. If as the values of one variable (say on X-axis) increase, those of the other variable (on Y-axis) increase, “r” is positive (a-c); however, if the latter decrease, “r” is negative (d-f). When the values of two variables have no clear relation, “r” is zero (g). The absolute values of “r” are higher when the individual data points are closer to a line showing the linear trend (a > b > c; d > e > f) The absolute value of r represents the strength of association. A value of 1.0 implies a perfect linear relationship between the two variables, i.e., all observations lie on a straight line [Figure 1a and d], whereas 0 indicates the absence of any linear relationship [Figure 1g]. Higher values (closer to 1.0) imply that individual observations lie close to an imaginary line describing the relationship between the two variables [Figure 1b and e], and lower values imply that the observations are more spread out [Figure 1c and f]. Square of correlation coefficient (r2), known as coefficient of determination, represents the proportion of variation in one variable that is accounted for by the variation in the other variable. For example, if height and weight of a group of persons have a correlation coefficient of 0.80, one can estimate that 64% (0.80 × 0.80 = 0.64) of variation in their weights is accounted for by the variation in their heights. It is possible to calculate P value for an observed correlation coefficient to determine whether a significant linear relationship exists between the two variables of interest or not. However, with medium- to large-sized samples, these methods show even small correlation coefficients to be highly significant and hence their use is generally eschewed.
Many of the above pitfalls are easily avoided if one first makes a scatter plot for the data and visually inspects it for nonlinear relationships, outliers, or presence of obvious subgroups. In addition, correlation analysis is also often inappropriately used to measure agreement between two methods of measuring the same thing (e.g., tumor volume measured using ultrasound and computed tomography). This will be discussed in the next article in this series. A relationship between two variables is sometimes taken as evidence that one causes the other. This is, however, often not true, and hence the popular statistical adage: “Correlation does not imply causation.” You may wish to visit https://en.wikipedia.org/wiki/Correlation_does_not_imply_causation for some interesting insights into how correlation can arise without any causative link. Examples of such noncausative correlation include (i) countries’ annual per capita chocolate consumption and the number of Nobel laureates per 10 million population;[1] (ii) weekly ice-cream consumption and a number of drowning incidents in swimming pools. These are due to the association of both the variables being studied to national income[2] and hot weather, respectively. Correlation analysis is a very powerful tool to explore relationships in data. However, one must be careful to use it only when it is applicable. Many of these problems can be avoided by a careful thought about the data, plotting the raw data (to look for nonlinear relationships, outliers, and heteroscedasticity of data), and by thinking in terms of coefficient of determination in preference to the correlation coefficient. There are no conflicts of interest. 1. Messerli FH. Chocolate consumption, cognitive function, and Nobel laureates. N Engl J Med. 2012;367:1562–4. [PubMed] [Google Scholar] 2. Maurage P, Heeren A, Pesenti M. Does chocolate consumption really boost Nobel Award chances? The peril of over-interpreting correlations in health studies. J Nutr. 2013;143:931–3. [PubMed] [Google Scholar] Articles from Perspectives in Clinical Research are provided here courtesy of Wolters Kluwer -- Medknow Publications |