Why Effect Size and Why Not Just Statistical Significance?
By: Monika Sharma & Sandeep Kavety
“Statistical significance is the least interesting thing about the results. You should describe the results in terms of measures of magnitude — not just, does treatment affect people, but how much does it affect them.”
Many research reports and articles frequently use the term “significant results,” which can often lead to confusion and ambiguity. This confusion arises because it is not always clear whether the research is referring to statistical significance or the practical significance of the results. It is crucial to distinguish between these two concepts because statistical significance merely indicates whether the observed difference is likely to be a result of chance or if a true difference exists between groups or samples. While this information of statistical significance can be useful, it does not provide insight into the magnitude or real-world importance of that difference, which is equally essential.
Practical significance, as defined by Ellis (2010), refers to the extent of the effect or the real-world significance of statistical findings. It is important to note that the statistical significance and practical significance are not necessarily strongly correlated. There is enough evidence where the results are statistically significant but have no practical significance or vice versa. In educational and impact research the important question for the stakeholders is how meaningful the results are.
Over the past two decades, research has pointed to standardized effect size measures as an effective way to address the question of practical significance. Effect size quantifies the meaningfulness of results, providing a statistical measure of practical significance. Unlike statistical significance, effect sizes are easily understandable and offer more meaningful interpretations. They emphasize the magnitude of the difference between the two groups, avoiding confounding with sample size, and making it possible for researchers to evaluate the effects of different interventions or treatments. Effect size helps to answer the research question which motivated the study.
Understanding Effect Size
“Effect size is a quantitative measure to estimate the magnitude of any intervention effect.”
“an effect size (ES) is the amount of anything that’s of research interest” (Cumming and Calin-Jageman, 2017, p. 111
Effect sizes can be reported either as a simple effect size or by using a standardized measure of effect size. Common examples of effect size include the mean difference between the two groups. These are called simple effect sizes (or unstandardised effect sizes). While common examples involve the mean difference between two groups, caution is advised when employing simple effect sizes, especially when the unit of analysis lacks inherent meaning. In such scenarios, standardized measures of effect size are recommended.
The standardized measure of effect is one which has been adjusted to the variability observed in the sample or the population it comes from. Researchers can quantify and discuss the application of their findings by using the standardized measure of effect size to describe the size of the observed effect. Effect size can be calculated for difference or relatedness between variables. An effect can be the result of a treatment revealed in a comparison between groups (e.g., treated and untreated groups) or it can describe the degree of association between two related variables (e.g., student motivation and academic achievement). If the research is comparing groups for differences, we would likely not expect to see much overlap in the data. If we were evaluating our data for similarity or association, we are looking for more overlap. Thus most of the effect sizes can be grouped into one of two families of effects: difference between groups (d family) and measures of association (r family). The d family effect sizes are calculated by dividing the difference between groups with the standard deviation of the observations. We will focus here mainly on effect sizes calculated for difference between groups i.e., d family.
Understanding Effect Size Calculation in Educational Research (d family)
In the case of impact evaluation in education most of the research is trying to find the answers about how much difference an intervention creates in outcomes. Standardized effect size measures quantify this impact, indicating both the magnitude and statistical significance. Most of these impact evaluation studies either follow a quasi-experimental research or the cross-sectional research design or longitudinal design.
Practical Significance in Quasi-Experimental Studies
In quasi-experimental studies, where randomization can be challenging, the role of effect size becomes instrumental in gauging the practical significance of observed differences. It aids in ensuring result comparability, informing evidence-based decision-making, and contributing to a nuanced understanding of the intervention’s impact.
Cross-sectional and Longitudinal Studies (System-Level Impacts)
Similarly in cross-sectional and longitudinal studies, employing effect size is vital. It quantifies the practical significance of changes, provides a clear understanding of system-level impacts and facilitates comparisons across different study periods.
Exploring Types of Comparisons in Educational Research
Delving into the diverse comparisons within educational research, we explore four common types of comparisons observed in educational research
- Comparison of same group’s scores: Also known as within group design or repeated measure design or panel study. For such a design the common statistical analysis is a paired t-test if there are only two groups which provides if the observed differences are statistically significant. To check for practical significance, we calculate the effect sizes by applying Cohen’s d for paired samples. If there are more than two groups then, Cohen d for repeated ANOVA will be applicable.
- Comparison of different groups: Known as between group design allows for comparing the differences between two or more groups. For a two group design such as comparison between girl and boys would use an independent sample t-test. For studies requiring more than two groups one would have to apply Analysis of variance test (ANOVA). Cohen’s d calculation for independent t-test would be used in case of t-test whereas for ANOVA its the Eta square which needs to be calculated (Lakens, 2013).
- Comparison of groups over different time periods: Also known as longitudinal design. Often we come across two types of longitudinal designs i.e. the one in which the same set of participants are followed over time also known as panel study and the other one (cohort study) where the different set of students are sampled at each time point. Cohort study focuses on observing changes within a specific population or cohort over multiple time points, even though the individuals being studied may vary. In case of panel study the same measure of d will be applicable as in for the repeated measure discussed in point 1. Whereas, for cohort study the Cohen d will be calculated similar to the one discussed in point 2.
- Difference in difference design: This method allows for attributional effects of an intervention. In such a design there is a pretest and a post test for both treatment and comparison/control group. These designs also vary for the use of panel or cohorts within the treatment and comparison group. It has been recommended to use pooled pretest standard deviation for the calculation of d, that way the intervention does not affect the standard deviation.
While applying any of the standardized measures of effect size in the research one must ensure to apply the correction needed for difference in sample sizes of groups or for variations in groups.
Interpreting effect size
Effect size, a quantitative measure gauging the magnitude of intervention impact, plays a pivotal role in research interpretation. Cumming and Calin-Jageman (2017) define it as the amount of anything of research interest, providing a substantive measure. Effect size interpretation assumes that values in the ‘control’ and ‘experimental’ groups follow a Normal distribution and share the same standard deviations. Effect sizes can be understood by examining the percentiles or ranks where the two distributions overlap. This helps us gauge how likely it is to identify where a value comes from or compare it to known effects and outcomes.
Standardized effect sizes, comparable to ‘Z-scores,’ offer a direct way to articulate differences between samples. For instance, an effect size of 0.30 implies that the average score in the experimental group surpasses 29% of the control group. While interpreting effect size one can use the standardized cut off’s like the one given by Cohen (1969). Applying standardized cutoffs, such as Cohen’s (1969) classification of 0.2 as small, 0.5 as medium, and 0.8 as large, aids in interpretation. However, the use of terms like ‘small,’ ‘medium,’ and ‘large’ is cautioned by Glass et al. (1981), who argue that the interpretation should consider contextual relevance and relative costs. In education, even a small effect size, if cost-effective and universally beneficial over time, could be a significant improvement in academic achievement . Therefore, selecting a benchmark for interpretation necessitates a thoughtful evaluation of the research context before attributing any interpretation.
Conclusion
In Summary, the integration of effect size in research is a transformative approach that goes beyond the conventional focus on statistical significance. It’s crucial to highlight that prominent research organizations such as the American Educational Research Association (AERA) and the American Psychological Association (APA) have been advocating for the reporting of effect sizes and their interpretation since the early 2000’s. This emphasis doesn’t favor using effect size over reporting of statistical significance, instead it underscores a comprehensive reporting approach that encompasses both statistical significance and effect sizes. Relying solely on reporting statistical significance is discouraged, as it only indicates the likelihood of an observed sample statistic occurring due to random sampling variability. This holistic perspective ensures a more complete and insightful picture, moving beyond the limitations of relying solely on statistical significance to convey research findings.
References:
Gene V. Glass in Kline R. B. (2004). Beyond Significance Testing: Reforming Data Analysis Methods in Behavioral Research. Washington DC: American Psychological Association. p. 95.
Schäfer T, Schwarz MA. The Meaningfulness of Effect Sizes in Psychological Research: Differences Between Sub-Disciplines and the Impact of Potential Biases. Front Psychol. 2019 Apr 11;10:813. doi: 10.3389/fpsyg.2019.00813. PMID: 31031679; PMCID: PMC6470248.
https://www.simplypsychology.org/effect-size.html
Rosenthal, R. (1994). Parametric measures of effect size. In H. Cooper & L. V. Hedges (Eds.), The Handbook of research synthesis (pp. 231–244). New York: Sage.
Lakens D.(2013). Calculating and reporting effect sizes to facilitate cumulative science: a practical primer for t-tests and ANOVAs. Frontiers in Psychology. Volume 4. https://doi.org/10.3389/fpsyg.2013.00863
Morris, S. B. (2008). Estimating Effect Sizes From Pretest-Posttest-Control Group Designs. Organizational Research Methods, 11(2), 364–386. http://doi.org/10.1177/1094428106291059
Lenhard, W. & Lenhard, A. (2022). Computation of effect sizes. Retrieved from: https://www.psychometrica.de/effect_size.html. Psychometrica. DOI: 10.13140/RG.2.2.17823.92329
Bloom, H. S., Hill, C. J., Black, A. R., & Lipsey, M. W. (2008). Performance trajectories and performance gaps as achievement effect-size benchmarks for educational interventions. Journal of Research on Educational effectiveness, 1(4), 289–328.https://doi.org/10.1080/19345740802400072
Scammacca, N. K., Fall, A.-M., & Roberts, G. (2015). Benchmarks for expected annual academic growth for students in the bottom quartile of the normative distribution. Journal of Research on Educational Effectiveness, 8(3), 366–379. https://doi.org/10.1080/19345747.2014.952464