The two kinds of Effect Size

At the start of ‘Visible Learning’, John Hattie talks about the two different ways to calculate the Effect Size.

Effect Size = (Mean of group at end of intervention – Mean of group at start of intervention) / Standard Deviation


Effect Size = (Mean of intervention group – Mean of control group) / Standard deviation

Now, obviously, real Mathematical things don’t have two different definitions because of the confusion this causes as we shall see.

The problem is that the first definition measures actual improvement whereas the second measures relative improvement.

To give an example, imagine Ann and Bob are both travelling to an education conference. They set off at the same time, driving their cars down the motorway.

We know that Ann drives at an average speed of 40 mph. How can we tell if Bob will get there first?

We could give his actual speed of 50 mph.

Or we could give his relative speed, he’s travelling 10 mph faster than Ann.

It doesn’t matter which one we use as long as we all know which definition we’re talking about.

The problem comes when we start using the same words to mean two different things and start mixing them up.

When comparing Bob’s speed to Ann’s, a good actual speed would be anything over 40 mph, but a good relative speed would be anything over 0 mph.

Now, if I say that Cathy is going at a speed of 30 mph, is that an actual speed, in which case she’s going the slowest, or a relative speed in which case she’s travelling the fastest?

Hattie only mentions the two different types of Effect Size once, at the start of the book, but the way he talks later on “Everything works” and “We need to compare with 0.40” shows that the definition he is using is the first one, measuring actual improvement. However, has he made that distinction when he was choosing his studies? I suspect that he has not realised that the two different ways of measuring would produce very different answers and he has just thrown both types of study all in together.

For John Hattie, any Effect Size bigger than 0.40 would be ‘good’.

Now which version does the Education Endowment Foundation use?

In their recent Chatterbooks study they say





This and the fact that they use control groups show that they are using the second way of calculating the Effect Size, the relative way.

So, for the Education Endowment Foundation, anything better than zero would be ‘good’.

So, to sum up, the two major players using the ‘the Effect Size’, John Hattie and the Education Endowment Foundation, are actually using it to mean two completely different calculations, one actual and one relative. For one of them, anything above 0.40 would be ‘good’, the other anything above zero would be ‘good’.



14 thoughts on “The two kinds of Effect Size

  1. It sounds like you are saying that the EEF are making the same mistake as Hattie, i.e., throwing in studies that use both methods of calculating the Effect Size’. If so, are you saying that using either source as a means of focusing our pedagogy is a waste of time?


    • No, I don’t think the EEF have used both ways of calculating the ‘Effect Size’, I think they use just one, but, they are using Statistics that no Mathematician has ever heard of.

  2. The Toolkit focuses on standardised mean difference as a measure of the impact of different interventions.
    The EEF calculates quantifies the effectiveness of a particular intervention by calculating the effect size as the standardised mean difference between two groups – for us, the intervention and the control group. The ES places the emphasis on the most important aspect of the intervention – the size of the effect – rather than its statistical significance, which conflates the effect size and sample size.
    In this respect, we are comparing the relative impact of the intervention as we want to identify the most effective ways as compared to ‘business as usual’.
    Rather than paraphrase everything in the Technical Appendices to the Toolkit, it may make more sense for those interested in the understanding the different methods of effect sizes to look at this:

    • Thanks James. This reply is the perfect example of 2 things. Firstly, how everyone involved with the ‘Effect Size’ just parrots the same old nonsense. You can usually see the same sentences repeated word for word in every piece written by ‘Effect Size’ users. No fresh thought, the blind leading the blind. And this brings me to my second point, the Mathematical quality of the blind men. James Richardson did a degree in Politics, PGCE in Geography and MA in Education, Culture and Society. He might not have done any Maths since GCSE, yet, he’s a Senior Analyst at the EEF. Typically, most Education Researchers will have no back-ground in Maths or Science, which is why they just all copy each other’s mistakes and never think to check their work with a Mathematician.

  3. That’s correct – the sentences are taken from the Technical Appendices of the Toolkit authored by Profs Coe and Higgins at Durham University. Thankfully we have experts that I can draw on so that we do not have to rely on non-statisticians like me to explain the technicalities. I was simply responding to a request from a teacher to comment on your blog so that they can understand the implications of the Toolkit for their school policies and practices. And this is what the Toolkit is trying to do: provide an accessible summary of research that can inform the educational decisions of school leaders. If there is alternative to Effect Sizes that will allow us to compare the impact of educational interventions then please do explain the methodology and its benefits.

    • Yes James. Use the Statistics that Mathematicians have developed over decades instead of a load of mumbo-jumbo conjured up by Psychologists that Mathematicians have never heard of.

  4. That isn’t really an answer to the question, Ollie. You are very keen to criticise effect sizes, but offer no actual comparative apart from “the statistics”. Effect sizes make use of the statistics that mathematicians have developed over decades and, although not perfect (no mathematical model for capturing real data is) they are a fairly accurate method of capturing different types of interventions.

    As a maths teacher myself, I think your obsessive railing against effect sizes is particularly unhelpful. I think most us want to see the profession improve and any research on that is helpful. Your concern with effect sizes doesn’t seem to go past moaning about them. Even when directly asked for an alternative by a member of EEF staff you simply attacked his personal experience and didn’t engage in what could have been a positive dialogue.

    Most of your claims are consistently backed up by claims like “no mathematicians are talking about this”, which would ring far truer if you were an academic rather than a tutor (I don’t imagine the kids you get to talk about maths have much opinion on effect sizes). When I was doing my PhD, most maths academics were open and collaborative with social scientists- indeed many on the programme did not come from pure mathematics backgrounds, many empirical politics courses involve a large focus on statistics.

    Your blog does seem to attract the attention of teachers. I really hope that you can use it to put out a positive message or a constructive suggestion on improving education, rather than complaints about the status quo.

    Kind regards,


    • No Mathematician has been involved with creation of the ‘Effect Size’. No Mathematician uses the ‘Effect Size’. No Mathematician has heard of the ‘Effect Size’. It is a mistake created by Psychologists then copied by Educationalists. Surely as a Mathematician you realise that you can’t just go around making up your own version of Maths. On what are you basing your idea that it is an accurate method?

  5. Puzzled by this post. “Effect size” is not a single thing, it’s a class of statistics which mean very different things (Pearson’s r is massively different to Hedges’ g, for instance; and both are different to Shepherd’s \pi, which amusingly really is an effect size).

    The two effect sizes you talk about have different names: Cohen’s d (the between groups comparison) and Cohen’s d_z (the within groups comparison). Anyone who says something like “the effect size is xx” has either misunderstood or is trying to simplify things in an unhelpful way.

    [Actually, it’s not totally clear whether your second effect size is d_z, as it depends on how you’ve calculated the standard deviation. It could be d_z, d_av or d_rm, each of which are different effect sizes].

    The important question is whether it’s ok, as you say Hattie does, to evaluate d and d_z on the same scale. It isn’t. See Daniel Lakens guide:

    • Mathematicians don’t call anything effect sizes, only Psychologists do. In Education there is a thing called the ‘Effect Size’ which is generally taken to be Cohen’s d. There has not been any distinction made between the two different ways (within/between groups) to calculate it which was the point of my post. There are different ways to calculate the standard deviation but they’re all pretty irrelevant to my overall theme, that, whichever version you use, the ‘Effect Size’ is incorrect.

      • Come on, it’s not “incorrect” to use a d. It’s just changing the units of a measurement from some real-world unit to a multiple of a SD. Think of it as analogous to measuring an angle as a degree or a radian. Whether or not changing the units is a useful thing to do can be the subject of an interesting discussion: it might be misleading or pointless, but it’s certainly not “incorrect” to do it.

  6. Also, here’s my response to James Richardson’s question. He asked “If there is alternative to Effect Sizes that will allow us to compare the impact of educational interventions then please do explain the methodology and its benefits.”

    Thom Baguley advocates, very persuasively I think, that in almost all situations unstandardised effect sizes (i.e. just a difference in means, or a regression coefficient) are preferable to standardised effect sizes such as Cohen’s d or similar. As he remarks:
    “It should never be assumed that the mere act of adopting a (superficially) standard metric makes comparisons legitimate (Morris & DeShon, 2002; Bond et al., 2003). It should also be remembered that there is nothing magical about the standardization process: it will not create meaningful comparisons when the original units are themselves not meaningful (Tukey, 1969).”

    See his detailed arguments here:

  7. Pingback: A Statistical Battleground | docendo discimus

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s