The ‘Effect Size’ is not a recognised mathematical technique

by ollieorange2

Three things you should know about the ‘Effect Size’

1. Mathematicians don’t use it

2. Mathematics textbooks don’t teach it.

3. Statistical packages don’t calculate it.

Despite a public challenge in March 2013, none of the advocates of the ‘Effect Size’ have been able to name a Mathematician, Mathematics textbook or Statistical package that uses it. They are welcome to correct this in the comments below.

27 thoughts on “The ‘Effect Size’ is not a recognised mathematical technique”

Dylan Wiliam

January 22, 2014 at 9:24 am

Here are some reasons why I find effect sizes are misleading, taken from pages 20 to 22 of:

Wiliam, D. (2010). An integrative summary of the research literature and implications for a new theory of formative assessment. In H. L. Andrade & G. J. Cizek (Eds.), Handbook of formative assessment (pp. 18-40). New York, NY: Taylor & Francis.

The use of standardized effect sizes to compare and synthesize studies is understandable, because few of the studies included in the various reviews published sufficient details to allow more sophisticated forms of synthesis to be undertaken, but relying on standardized effect sizes in educational studies creates substantial difficulties of interpretation, for two reasons.
First, as Black and Wiliam (1998a) noted, effect size is influenced by the range of achievement in the population. An increase of 5 points on a test where the population standard deviation is 10 points would result in an effect size of 0.5 standard deviations. However, the same intervention when administered only to the upper half of the same population, provided that it was equally effective for all students, would result in an effect size of over 0.8 standard deviations, due to the reduced variance of the subsample. An often-observed finding in the literature—that formative assessment interventions are more successful for students with special educational needs (for example in Fuchs & Fuchs, 1986)—is difficult to interpret without some attempt to control for the restric- tion of range, and may simply be a statistical artifact.

The second and more important limitation of the meta-analytic reviews is that they fail to take into account the fact that different outcome measures are not equally sensitive to instruction (Popham, 2007). Much of the methodology of meta-analysis used in education and psychology has been borrowed uncritically from the medical and health sciences, where the different studies being combined in meta-analyses either use the same outcome measures (e.g., 1-year survival rates) or outcome measures that are reasonably consistent across different settings (e.g., time to discharge from hospital care). In education, to aggregate outcomes from different studies it is necessary to assume that the outcome measures are equally sensitive to instruction.

It has long been known that teacher-constructed measures have tended to show greater effect sizes for experimental interventions than obtained with standardized tests, and this has sometimes been regarded as evidence of the invalidity of teacher- constructed measures. However, as has become clear in recent years, assessments vary greatly in their sensitivity to instruction—the extent to which they measure the things that educational processes change (Wiliam, 2007b). In particular, the way that standardized tests are constructed reduces their sensitivity to instruction. The reliability of a test can be increased by replacing items that do not discriminate between candidates with items that do, so items that all students answer correctly, or that all students answer incorrectly, are generally omitted. However, such systematic deletion of items can alter the construct being measured by the test, because items related to aspects of learning that are effectively taught by teachers are less likely to be included than items that are taught ineffectively.

For example, an item that is answered incorrectly by all students in the seventh grade and answered correctly by all students in the eighth grade is almost certainly assessing something that is changed by instruction, but is unlikely to be retained in a test for seventh graders (because it is too hard), nor in one for eighth graders (because it is too easy). This is an extreme example, but it does highlight how the sensitivity of a test to the effects of instruction can be significantly affected by the normal processes of test development (Wiliam, 2008).

The effects of sensitivity to instruction are far from negligible. Bloom (1984) famously observed that one-to-one tutorial instruction was more effective than average group- based instruction by two standard deviations. Such a claim is credible in the context of many assessments, but for standardized tests such as those used in the National Assessment of Educational Progress (NAEP), one year’s progress for an average student is equivalent to one-fourth of a standard deviation (NAEP, 2006), so for Bloom’s claim to be true, one year’s individual tuition would produce the same effect as 9 years of average group-based instruction, which seems unlikely. The important point here is that the outcome measures used in different studies are likely to differ significantly in their sensitivity to instruction, and the most significant element in determining an assessment’s sensitivity to instruction appears to be its distance from the curriculum it is intended to assess.
Ruiz-Primo, Shavelson, Hamilton, and Klein (2002) proposed a five-fold classification for the distance of an assessment from the enactment of curriculum, with examples of each:
1. Immediate, such as science journals, notebooks, and classroom tests;
2. Close, or formal embedded assessments (for example, if an immediate assessment asked about number of pendulum swings in 15 seconds, a close assessment would
ask about the time taken for 10 swings);
3. Proximal, including a different assessment of the same concept, requiring some
transfer (for example, if an immediate assessment asked students to construct boats out of paper cups, the proximal assessment would ask for an explanation of what makes bottles float or sink);
4. Distal, for example a large-scale assessment from a state assessment framework, in
which the assessment task was sampled from a different domain, such as physical science, and where the problem, procedures, materials and measurement methods differed from those used in the original activities; and
5. Remote, such as standardized national achievement tests.

As might be expected, Ruiz-Primo et al. (2002) found that the closer the assessment was to the enactment of the curriculum, the greater was the sensitivity of the assessment to the effects of instruction, and that the impact was considerable. For example, one of their interventions showed an average effect size of 0.26 when measured with a proximal assessment, but an effect size of 1.26 when measured with a close assessment.

In none of the meta-analyses discussed above was there any attempt to control for the effects of differences in the sensitivity to instruction of the different outcome measures. By itself, it does not invalidate the claims that formative assessment is likely to be effective in improving student outcomes. Indeed, in all likelihood, attempts to improve the quality of teachers’ formative assessment practices are likely to be considerably more cost-effective than many, if not most, other interventions (Wiliam & Thomson, 2007). However, failure to control for the impact of this factor means that considerable care should be taken in quoting particular effect sizes as being likely to be achieved in practice, and other measures of the impact, such as increases in the rate of learning, may be more appropriate (Wiliam, 2007c). More importantly, attention may need to be shifted away from the size of the effects and toward the role that effective feedback can play in the design of effective learning environments (Wiliam, 2007a). In concluding their review of over 3,000 studies of the effects of feedback interventions in schools, colleges and workplaces, Kluger and DeNisi observed that:

considerations of utility and alternative interventions suggest that even an FI [feedback intervention] with demonstrated positive effects should not be administered wherever possible. Rather additional development of FIT [feedback intervention theory] is needed to establish the circumstance under which positive FI effects on performance are also lasting and efficient and when these effects are transient and have questionable utility. This research must focus on the processes induced by FIs and not on the general question of whether FIs improve performance— look how little progress 90 years of attempts to answer the latter question have yielded. (1996 p. 278)

Reply
- paceni
  
  September 28, 2014 at 8:41 am
  
  How close is OECD Pisa to the enactment of the curriculum in the UK & Ireland given the influence that governments and education ministers allocate to something that doesn’t measure anything.
  
  Reply
EBEFL

January 22, 2014 at 10:35 am

What is the significance of saying that “Mathematics textbooks don’t teach it”?

Reply
- ollieorange2
  
  January 22, 2014 at 11:45 am
  
  My point really is that nobody in the world of Maths has ever heard of the ‘Effect Size’ despite the fact that it purports to solve a problem they have been scratching their heads over for decades.
  
  Reply
  - EBEFL
    
    January 27, 2014 at 11:00 am
    
    What problem is that?
Jan Tishauser

January 23, 2014 at 1:07 pm

The most common effect size Pearson’s R. Every statistical package calculates it.

Reply
- ollieorange2
  
  January 23, 2014 at 4:04 pm
  
  In three years doing a Maths degree and sixteen years teaching A level Maths, I have never heard Pearson Correlation Coefficient referred to as an effect size. Nevertheless, shall we assume that whenever I refer to the ‘Effect Size’ I’m talking about Cohen’s d.
  
  Reply
- John Boberg
  
  October 5, 2017 at 12:08 am
  
  There are tons of effect sizes calculated in statistical packages. They vary according to the test you are performing: Pearson’s r, R-squared, multiple correlation . . .. The reason you don’t hear the term much is that they all have names. However, I don’t think you have done enough research. Every statistics book I own has numerous references to “effect size.”
  
  Reply
Pingback: Some things you ought to know about effect sizes | David Didau: The Learning Spy
Asbro

January 25, 2014 at 1:56 pm

So you’re saying that research using effect sizes cannot be relied upon. Do you have an alternative or could the calculation of effect size be qualified with some indicator (like error bars or something) so that an informed decision based on evidence could be made. Or do we need to go back to the drawing board and rely on intuition until we are given a clearer direction on this? How do we use the effect size research to help us make decisions if it can’t be trusted?

Reply
- ollieorange2
  
  January 25, 2014 at 2:05 pm
  
  I would suggest using Statistics invented by Mathematicians rather than Educational Psychologists.
  
  Reply
Jaap Walhout

January 28, 2014 at 6:33 pm

“3. Statistical packages don’t calculate it.”

That’s not true. In R both Cohen’s d & eta squared can be calculated with specific functions or is is already given in the output of an analysis

Reply
- ollieorange2
  
  January 30, 2014 at 1:47 pm
  
  R is a free, open-source programming language. Obviously a user has defined a Cohen’s d function and released it.
  
  My bigger question is, out of the dozens of commercial statistical packages, why don’t any of them calculate Cohen’s d? Doesn’t this indicate that Cohen’s d is completely unknown in the general Mathematical community?
  
  Reply
  - Mavan
    
    February 25, 2014 at 9:35 pm
    
    This is nonsense. SPSS calculates effect sizes. See the “estimates of effect sizes” option in the ANOVA dialogue box.
  - ollieorange2
    
    March 5, 2014 at 12:38 pm
    
    SPSS calculates Eta squared not Cohen’s d statistic.
    
    As a side-note, nobody calls these things Effect Sizes apart from Social Scientists, certainly not Mathematicians.
Mavan

June 22, 2014 at 11:10 pm

If you can find me a statistics textbook which doesn’t refer to eta squareds as effect sizes, I will be very surprised.

Here, for example, is a section of an open source stats textbook: http://onlinestatbook.com/2/effect_size/variance_explained.html

Reply
- ollieorange2
  
  June 23, 2014 at 7:42 am
  
  If you can find me a proper Mathematics textbook (not Social Sciences or open source) that refers to effect sizes I shall be very surprised.
  
  Reply
  - Mavan
    
    June 23, 2014 at 6:29 pm
    
    The textbook I link to above is written by an academic from a university statistics department (http://statistics.rice.edu/feed/Faculty.aspx).
    
    I don’t find it surprising that mathematics textbooks are unlikely to discuss effect sizes. Similarly statistics textbooks tend not to discuss point set topology.
  - ollieorange2
    
    June 24, 2014 at 7:42 am
    
    David Lane, Rice University. Actually a Professor of Psychology. Only Psychologists and Educationalists use the Effect Size, no-one else. Definitely not Mathematicians.
  - Mavan
    
    June 25, 2014 at 10:56 am
    
    He has a joint appointment in the Depts of Statistics and Psychology (http://statistics.rice.edu/feed/Faculty.aspx).
    
    I agree that mathematicians tend not to talk about effect sizes. Why would they? Similarly statisticians don’t talk about cardinality or knot theory.
  - ollieorange2
    
    June 25, 2014 at 10:07 pm
    
    He degree was in Psychology. His PhD was in Psychology. He was appointed as a Professor of Psychology. He’s a Psychologist.
    
    Click to access vita_lane.pdf
Pingback: A Statistical Battleground | docendo discimus
Pingback: Hattie’s book on visible learning | mrsadleir
Pingback: Moving from marking towards feedback | Improving Teaching
Pingback: Moving from marking towards feedback | Improving Teaching
Stub

October 30, 2014 at 3:47 am

I have read all you have put up about Hattie with interest. The boss recently went to a presentation by Hattie and seems to be hailing him as the new messiah. While I have no problem with focussing on the things that are important, my difficulty is with context. In real classroom terms, how can a factor have the same `effect` in every environment? According to this work, we are supposed to believe that class sizes of 40 give similar quality of learning to a class of 4. Sorry, that is just not true. Hattie’s bit about `now we are a profession, cos we use numbers` is particularly galling. I will ask the boss (who has a numbers background) about negative probabilities and see what he says. Thanks for your work.

Reply
@jrosell

April 23, 2016 at 4:39 pm

What do you think about that pdf warning about p-value=0.05? http://amstat.tandfonline.com/doi/abs/10.1080/00031305.2016.1154108#/doi/abs/10.1080/00031305.2016.1154108

Reply