The ‘Effect Size’ is not a recognised mathematical technique

Three things you should know about the ‘Effect Size’

1.   Mathematicians don’t use it

2.   Mathematics textbooks don’t teach it.

3.   Statistical packages don’t calculate it.

Despite a public challenge in March 2013, none of the advocates of the ‘Effect Size’ have been able to name a Mathematician, Mathematics textbook or Statistical package that uses it. They are welcome to correct this in the comments below.

27 thoughts on “The ‘Effect Size’ is not a recognised mathematical technique

  1. Here are some reasons why I find effect sizes are misleading, taken from pages 20 to 22 of:

    Wiliam, D. (2010). An integrative summary of the research literature and implications for a new theory of formative assessment. In H. L. Andrade & G. J. Cizek (Eds.), Handbook of formative assessment (pp. 18-40). New York, NY: Taylor & Francis.

    The use of standardized effect sizes to compare and synthesize studies is understandable, because few of the studies included in the various reviews published sufficient details to allow more sophisticated forms of synthesis to be undertaken, but relying on standardized effect sizes in educational studies creates substantial difficulties of interpretation, for two reasons.
    First, as Black and Wiliam (1998a) noted, effect size is influenced by the range of achievement in the population. An increase of 5 points on a test where the population standard deviation is 10 points would result in an effect size of 0.5 standard deviations. However, the same intervention when administered only to the upper half of the same population, provided that it was equally effective for all students, would result in an effect size of over 0.8 standard deviations, due to the reduced variance of the subsample. An often-observed finding in the literature—that formative assessment interventions are more successful for students with special educational needs (for example in Fuchs & Fuchs, 1986)—is difficult to interpret without some attempt to control for the restric- tion of range, and may simply be a statistical artifact.

    The second and more important limitation of the meta-analytic reviews is that they fail to take into account the fact that different outcome measures are not equally sensitive to instruction (Popham, 2007). Much of the methodology of meta-analysis used in education and psychology has been borrowed uncritically from the medical and health sciences, where the different studies being combined in meta-analyses either use the same outcome measures (e.g., 1-year survival rates) or outcome measures that are reasonably consistent across different settings (e.g., time to discharge from hospital care). In education, to aggregate outcomes from different studies it is necessary to assume that the outcome measures are equally sensitive to instruction.

    It has long been known that teacher-constructed measures have tended to show greater effect sizes for experimental interventions than obtained with standardized tests, and this has sometimes been regarded as evidence of the invalidity of teacher- constructed measures. However, as has become clear in recent years, assessments vary greatly in their sensitivity to instruction—the extent to which they measure the things that educational processes change (Wiliam, 2007b). In particular, the way that standardized tests are constructed reduces their sensitivity to instruction. The reliability of a test can be increased by replacing items that do not discriminate between candidates with items that do, so items that all students answer correctly, or that all students answer incorrectly, are generally omitted. However, such systematic deletion of items can alter the construct being measured by the test, because items related to aspects of learning that are effectively taught by teachers are less likely to be included than items that are taught ineffectively.

    For example, an item that is answered incorrectly by all students in the seventh grade and answered correctly by all students in the eighth grade is almost certainly assessing something that is changed by instruction, but is unlikely to be retained in a test for seventh graders (because it is too hard), nor in one for eighth graders (because it is too easy). This is an extreme example, but it does highlight how the sensitivity of a test to the effects of instruction can be significantly affected by the normal processes of test development (Wiliam, 2008).

    The effects of sensitivity to instruction are far from negligible. Bloom (1984) famously observed that one-to-one tutorial instruction was more effective than average group- based instruction by two standard deviations. Such a claim is credible in the context of many assessments, but for standardized tests such as those used in the National Assessment of Educational Progress (NAEP), one year’s progress for an average student is equivalent to one-fourth of a standard deviation (NAEP, 2006), so for Bloom’s claim to be true, one year’s individual tuition would produce the same effect as 9 years of average group-based instruction, which seems unlikely. The important point here is that the outcome measures used in different studies are likely to differ significantly in their sensitivity to instruction, and the most significant element in determining an assessment’s sensitivity to instruction appears to be its distance from the curriculum it is intended to assess.
    Ruiz-Primo, Shavelson, Hamilton, and Klein (2002) proposed a five-fold classification for the distance of an assessment from the enactment of curriculum, with examples of each:
    1. Immediate, such as science journals, notebooks, and classroom tests;
    2. Close, or formal embedded assessments (for example, if an immediate assessment asked about number of pendulum swings in 15 seconds, a close assessment would
    ask about the time taken for 10 swings);
    3. Proximal, including a different assessment of the same concept, requiring some
    transfer (for example, if an immediate assessment asked students to construct boats out of paper cups, the proximal assessment would ask for an explanation of what makes bottles float or sink);
    4. Distal, for example a large-scale assessment from a state assessment framework, in
    which the assessment task was sampled from a different domain, such as physical science, and where the problem, procedures, materials and measurement methods differed from those used in the original activities; and
    5. Remote, such as standardized national achievement tests.

    As might be expected, Ruiz-Primo et al. (2002) found that the closer the assessment was to the enactment of the curriculum, the greater was the sensitivity of the assessment to the effects of instruction, and that the impact was considerable. For example, one of their interventions showed an average effect size of 0.26 when measured with a proximal assessment, but an effect size of 1.26 when measured with a close assessment.

    In none of the meta-analyses discussed above was there any attempt to control for the effects of differences in the sensitivity to instruction of the different outcome measures. By itself, it does not invalidate the claims that formative assessment is likely to be effective in improving student outcomes. Indeed, in all likelihood, attempts to improve the quality of teachers’ formative assessment practices are likely to be considerably more cost-effective than many, if not most, other interventions (Wiliam & Thomson, 2007). However, failure to control for the impact of this factor means that considerable care should be taken in quoting particular effect sizes as being likely to be achieved in practice, and other measures of the impact, such as increases in the rate of learning, may be more appropriate (Wiliam, 2007c). More importantly, attention may need to be shifted away from the size of the effects and toward the role that effective feedback can play in the design of effective learning environments (Wiliam, 2007a). In concluding their review of over 3,000 studies of the effects of feedback interventions in schools, colleges and workplaces, Kluger and DeNisi observed that:

    considerations of utility and alternative interventions suggest that even an FI [feedback intervention] with demonstrated positive effects should not be administered wherever possible. Rather additional development of FIT [feedback intervention theory] is needed to establish the circumstance under which positive FI effects on performance are also lasting and efficient and when these effects are transient and have questionable utility. This research must focus on the processes induced by FIs and not on the general question of whether FIs improve performance— look how little progress 90 years of attempts to answer the latter question have yielded. (1996 p. 278)

    • How close is OECD Pisa to the enactment of the curriculum in the UK & Ireland given the influence that governments and education ministers allocate to something that doesn’t measure anything.

    • In three years doing a Maths degree and sixteen years teaching A level Maths, I have never heard Pearson Correlation Coefficient referred to as an effect size. Nevertheless, shall we assume that whenever I refer to the ‘Effect Size’ I’m talking about Cohen’s d.

    • There are tons of effect sizes calculated in statistical packages. They vary according to the test you are performing: Pearson’s r, R-squared, multiple correlation . . .. The reason you don’t hear the term much is that they all have names. However, I don’t think you have done enough research. Every statistics book I own has numerous references to “effect size.”

  2. Pingback: Some things you ought to know about effect sizes | David Didau: The Learning Spy

  3. So you’re saying that research using effect sizes cannot be relied upon. Do you have an alternative or could the calculation of effect size be qualified with some indicator (like error bars or something) so that an informed decision based on evidence could be made. Or do we need to go back to the drawing board and rely on intuition until we are given a clearer direction on this? How do we use the effect size research to help us make decisions if it can’t be trusted?

  4. “3. Statistical packages don’t calculate it.”

    That’s not true. In R both Cohen’s d & eta squared can be calculated with specific functions or is is already given in the output of an analysis

    • R is a free, open-source programming language. Obviously a user has defined a Cohen’s d function and released it.

      My bigger question is, out of the dozens of commercial statistical packages, why don’t any of them calculate Cohen’s d? Doesn’t this indicate that Cohen’s d is completely unknown in the general Mathematical community?

      • This is nonsense. SPSS calculates effect sizes. See the “estimates of effect sizes” option in the ANOVA dialogue box.

      • SPSS calculates Eta squared not Cohen’s d statistic.

        As a side-note, nobody calls these things Effect Sizes apart from Social Scientists, certainly not Mathematicians.

  5. Pingback: A Statistical Battleground | docendo discimus

  6. Pingback: Hattie’s book on visible learning | mrsadleir

  7. Pingback: Moving from marking towards feedback | Improving Teaching

  8. Pingback: Moving from marking towards feedback | Improving Teaching

  9. I have read all you have put up about Hattie with interest. The boss recently went to a presentation by Hattie and seems to be hailing him as the new messiah. While I have no problem with focussing on the things that are important, my difficulty is with context. In real classroom terms, how can a factor have the same `effect` in every environment? According to this work, we are supposed to believe that class sizes of 40 give similar quality of learning to a class of 4. Sorry, that is just not true. Hattie’s bit about `now we are a profession, cos we use numbers` is particularly galling. I will ask the boss (who has a numbers background) about negative probabilities and see what he says. Thanks for your work.

Leave a comment