Now the writer of the EEF report on Philosophy says that the way Mathematicians and Scientists do Statistics is all wrong

The writer of the EEF report on Philosophy, Professor Stephen Gorad, has now openly admitted that he thinks that the way that Mathematicians and Scientists do Statistics is wrong and should be banned.

The significance test is how Mathematicians and Scientists do Statistics.

Psychologists invented the Effect Size as the “New Statistics” to replace it. It is unknown to Mathematicians.


etwhCapturegwgw eg2g 242h2 2g2g2 g24g2

When the Physicists at the Large Hadron Collider were looking for the Higgs Boson particle, to be sure they had really found it they used a significance test, called the five sigma test.


So, on one side of the argument we have the people who found the Higgs Boson, the other, Stephen Gorard. The decision is yours.


15 thoughts on “Now the writer of the EEF report on Philosophy says that the way Mathematicians and Scientists do Statistics is all wrong

  1. This is a strange argument. Effect sizes are not unknown to anyone who uses significance testing and knows what they’re doing: Ns, ps and ESs are all functions of one another.

    • I’m willing to bet you work in Education or Psychology Luke, the only fields that use the Effect Size. I did a degree in Maths, half of which was Statistics, and never heard of it.

  2. I’ve been following this one, and interested to read your views. I wonder if you would be willing to expand on them a little. I saw you debating Rob Coe at researchED 2013 and the argument you presented there was pretty much identical to the one you make above, in your post and in your response to the first comment: essentially, people who advocate use of effect sizes are not mathematicians so their arguments are not valid.

    That’s a fair place to begin a debate, I suppose. Who could complain about someone suggesting that statisticians are the people most likely to know what they’re talking about when it comes to statistics? But now, as then, I’m not really seeing you develop this line any further. You are making an appeal to authority, which, without further qualification is a logical fallacy. I would really like to see you deal with the substance of how Gorard has explained why he thinks statistical significance, p values, confidence intervals etc. are not appropriate in this case. He deals with it in more detail in his book Research Design (see esp. pp 55 and 179).

    You will of course understand why just asserting that ‘particle physicists say statistical significance is important in making claims about physical particles, so therefore educationalists are wrong to suggest it is not important for children in schools’ might not been viewed as a particularly robust basis for your position. Have I characterised your position correctly?

    I have no dog in this fight, I just want to understand the statistics better.

    • I understand what Stephen Gorard is saying, it’s not even original, it’s just a rehash of things Psychologists have been saying for years. He’s just wrong. You can look up the reasoning behind statistical significance and p values in any advanced Statistics text-book. The Statistics he’s doing is nothing special, it’s actually very basic stuff. Statistics is statistics, whether you’re doing particle Physics or educational research.
      One of the problems with persuading people is there’s nothing to attack. The normal way Maths works is you release a new idea with the reasoning behind it and other Mathematicians check it to see if it is correct. That didn’t happen in this case, one day a Psychologist called Gene Glass just decided it was a good idea to use the Effect Size for Education Research and they’ve used it ever since.

  3. Thanks for expanding. Much appreciated. On a related tangent, a problem with educational research is that it is difficult to synthesise the results of many trials on similar interventions in a meta-analysis, because so often investigators use different measures to assess outcomes. As I understand Gene Glass proposed effect sizes as a way to facilitate such synthesis. You have pointed out the flaw in effect sizes – apart from anything else people don’t really know what they mean. That said, from your point of view as a statistician, can you see any way around this problem?

    • It’s worth pointing that even if you did need the Effect Size for Meta-Analysis you don’t need it for the one-off studies that Stephen Gorard is doing.
      I don’t think there is any way round it, if you’ve measured things in different ways then you can’t stick them together. The usual thing in medicine is a systematic review which lists all the relevant studies of a topic and judges their reliability but doesn’t try to combine all the data.

      • That’s helpful. Thank you. I agree that (because everyone uses different outcome measure) if we want to synthesise results of research in education then we may have to settle for an imperfect characterisation of effect, but that that doesn’t mean we need to accept effect sizes as the only (or even main) measure of impact in a one off trial.

        You may want to refamiliarise yourself with the methodology of systematic reviews in medicine though. The power of SRs like those prepared by the Cochrane Collaboration is precisely because they aggregate the results of many trials. This allows the natural ‘wobbling’ of data over time and space to be smoothed out somewhat and gives us an estimate of effect based on the totality of the data. Of course this is helped by medicine’s more homogeneous outcome measures, such as mortality.

      • I’m aware that the Systematic reviews have started to aggregate data led by the Cochrane Collaboration. Again, if you look at the back-ground of the people pushing it, all Psychology, Psychiatry related. So, I think its spreading like a virus into Medicine as well unfortunately.

      • It was perinatal medicine that kicked off the Cochrane Collaboration in 1989 with a series of aggregative systematic reviews. So, I’m unclear where the idea that they have “started” to aggregate data comes from, nor the notion that this was led by psychologists and psychiatrists.

  4. It is ‘ollieorange’ who is wrong here (and not for the first time, as Hamish and others have noted). Is that why a lower-case pseudonym is required (an odd choice for someone who regularly checks the background of other people via the internet)? He does not comprehend ‘effect’ sizes (which is a misnomer as they are not really about effects of course, but an attempt to standardise differences). Standardising differences is important so that we can be clearer to readers – a change of 10cm may be large for a particle but not much for a braking difference.

    Elsewhere ‘ollieorange’ discusses the use of effect sizes by EEF. His example involves an intervention with two classes/groups. Both start with prior scores of 50%. The first group improves to 60% – which ‘ollieorange’ describes as a growth of ‘10%’., although it is actually a growth of 10 percentage points or 20%, whether described by a mathematician or an artist. The second group improves to 70%. So, ‘ollieorange’ claims that if the first group had a standard deviation (of the post or the gain unspecified) of 5, and the second group a standard deviation of 20, the ‘effect size for the first would be 2, and for second group 1. It would appear that the second group did better, according to ‘ollieorange’ and his weird approach to arithmetic.

    Of course, this is just wrong (but perhaps told like that as deliberate casuistry since surely even a media studies graduate would not make such a simple mistake). We are not interested in a before-and-after design. Here we have comparator groups. The second group made the greater improvement. That is set up at the outset by ‘ollieorange’. The standardisation to ‘effect’ size does not alter that. It merely looks at that relative improvement for the second group as a proportion of the scatter or variation in the scores.

    The correct way to portray the comparative result is to note the difference between the groups in terms of post-test scores, since the prior scores were well-balanced. This would be 10. Note that the same 10 would arise if instead we used the difference in gain or progress scores between the groups. To obtain the ‘effect’ size we divide the 10 difference in post-test scores by their overall standard deviation. We do not know this key figure since ‘ollieorange’ does not state it for his example. Let’s imagine it is 10 (somewhere between the 5 and 20 for the respective groups). This would make the relative improvement of group 2 compared to group 1 have an ‘effect’ size of 1. It’s simple, easy to derive, and to comprehend. It does have some important limitations, but confusing which group made the greater improvement is not one of those. That takes a special skill.

    • Why not just use the 10% and the 20% to compare, even easier? No need to get the standard deviation involved at all.

      I was using one of the two definitions of the Effect Size given by Hattie on Page 8 of Visible Learning. I thought the whole point of the Effect Size was to be able to compare the improvement of different groups.

  5. And the comparative design does just that – compares the achievement of different groups. But my version does it sensibly. ‘Effect’ sizes can be used for other things (including non-comparative-between-groups change over time). Like any approach they have limitations (see Gorard, S. (2006) Towards a judgement-based statistical analysis, British Journal of Sociology of Education, 27, 1, 67-80 for a major one). Generally they are just an attempt to standardise differences.

    Straightforward reporting of raw results is the same – clear but sometimes miselading. See, for example, Works well in your made-up example because both groups are identical at the outset.

    Actually I would prefer not to bring in the SD – just not the reasin you suggest. See

    And this is in no way a defence of the Hattie synthesis of studies without consideration of design, bias or attrition. See my review Gorard, S. (2009) Review of Visible learning, by John Hattie, Abingdon, Routledge, 2009, 378pp., £24.99 (paperback), ISBN 9780415476188, Cambridge Journal of Education, 39, 4, 528-531:

    Review of:
    Visible learning, by John Hattie, Abingdon, Routledge, 2009, 378pp., £24.99 (paperback), ISBN 978-0-415-47618-8

    Stephen Gorard, University of Birmingham

    This book is a synthesis of the results of over 800 meta-analyses examining factors and processes associated with pupil learning and attainment at school. Although dated 2009, it began to make newspaper headlines in late 2008. The findings have been reported to the DCSF in England, and have led to a number of critical and not so critical blogs worldwide. It is clearly an important book that should not be ignored. It consists of three introductory chapters (the problem, an explanation of techniques for meta-analysis, and the underlying argument of the book laid bare), seven substantive results chapters working from pupil and home background to variation in teaching approaches, and a final chapter (“bringing it all together”). There are also appendices with more details of the meta-analyses used, and the usual references and an index (rather too brief, I felt, to locate studies in particular areas). Hattie works in New Zealand, and some of the material is linked to his local context, but the book is generally international in remit.

    In each substantive chapter, the findings of each meta-analysis are presented in the usual table and text format, but also in a reasonably clear pictorial ‘barometer of influence’ depicting the scale of the ‘effect’ size. For example, pupil prior achievement has an effect size of 0.67 on subsequent attainment, family SES an effect size of 0.57, class size 0.21, teacher knowledge of their subject matter 0.09, and so on. Putting it all together, the author claims that what the synthesis does is to make teaching visible. The first major conclusion (p.238) is that ‘Teachers are among the most powerful influences in learning’. Apparently, teachers need to be actively engaged in personalised pupil learning, and to work in an environment where ‘error is welcomed as a learning opportunity’ (p.239). The synthesis also suggests that increased financial resources in themselves will not make a difference, nor will reduced class sizes, increased subject knowledge, improved school compositions in terms of pupils from different backgrounds, or whizzo new types of schools such as Academies and Charter schools, and so on.

    Some of these findings are welcome. It is especially important, and ethical, that we try to focus future research and development on areas where the potential improvements are huge, and will survive translation into practice and over time and place. But are these conclusions warranted? Hattie correctly argues for inclusiveness of material when conducting a synthesis of evidence (p.11). Rather than having artificial thresholds for the quality of studies to be included, or the dead hand of checklists, a meta-analysis can use all available evidence and weight it for relevance and quality. This is still subjective (what else could it be?) but at least it is not an all or nothing approach. Hattie is critical of a review of adult literacy by Torgerson et al. (2004) which rejected 4,526 studies out of 4,555 because they were not randomised controlled trials. But he does something similar by only including studies that yield a numeric answer, so that they can be meta-analysed. This outright rejection of the majority of education research that is not encapsulated by a measure is not fair; nor is it necessary. There are appropriate methods of synthesis that allow full inclusion – a Bayesian approach is one (Gorard with Taylor 2004). Hattie is also incorrect in writing off the Torgerson et al. approach so easily. What they wanted to do was draw causal inferences, and this must surely involve rigorous evaluation of mature innovations, which is all that randomised controlled trials are. When Hattie concludes that there is a huge ‘effect’ size of 1.28 for the Piagetian stage, or 0.88 for micro-teaching, when coupled with pupil attainment, he implies or even explicitly states that improving the stage or encouraging micro-teaching will enhance attainment. This kind of conclusion is not warranted by the passive datasets trawled by Hattie. He reminds us at the outset that his method provides only associations (similar to the R-squared effect size of a correlation), however they might be described later in the book. He cannot, for example, say that class size has a limited effect, only that the normal range of variation (perhaps 16-40) is associated with only small differences in outcomes. The class size can be consequence of (i.e. ‘pulled by’) the outcomes (such as when the easier to teach are in larger groups, or where independent schools have lower pupil-teacher ratios), and of course the data is silent on classes of 2 or 200.

    This meta-analysis of meta-analyses is useful and fun to read, but by design loses the underlying inherent variation, and tends to provide findings that are banal at first reading (e.g. teachers must have ‘understanding of their [teaching] content’, p.238). Precisely because it accepts a passive approach to research and development this synthesis can be seen as inherently conservative, accepting the twentieth century structure of schools worldwide and the power relations between learner and teacher through necessity. It also assumes that there is a standard gauge for effect sizes, which there clearly should not be (Gorard 2006), makes almost no attempt to link these effect size benefits to their costs (pp.256-257 only), and makes no attempt at all to link them to unintended and disadvantageous consequences. The book is also let down throughout by the use of devices like the standard error, which is derived from sampling theory and makes a number of assumptions about the random nature of sampling and the distribution of measurement errors. The meta-analyses here include samples that are not even vaguely random in nature, and whose distribution is unknown. Under these conditions the standard error is meaningless.

    Some of the material in the book that I am familiar with is simply incorrect, and this may, or may not, be indicative of how material unfamiliar to me is handled. As one example, on p.239, Hattie states that a study by Fiske and Ladd (2004) showed how a school choice scheme in New Zealand had increased ‘the disparity between top and bottom schools… dramatically’, and led to white flight from lower SES schools. In fact, the data from Fisk and Ladd showed something rather more positive. All of their key tables (such as 7-2 to 7-4) present data for the years 1991, 1996 and 1997, and show a slight increase in intake segregation from 1991 to 1996/97. However, 1990 was the last year before the reforms, whereas 1991 was the first year after the new policy and the only year in which contested school places were allocated by lottery. As Fiske and Ladd admit in a footnote on page 194, ‘the indexes fell substantially between 1990 and 1991’. By 1997, segregation in New Zealand was still substantially lower than in 1990. By not publishing the pre-choice levels of segregation, the analysis by Fiske and Ladd makes it look as though segregation increased after the extension of choice, whereas it actually fell substantially and then crept up again (following the abolition of the lottery) just as it did in the UK (Gorard 2009).

    What to conclude? This is and is going to be an important book – one that needs discussion and serious and more prolonged consideration than I have been able to do here. But it is not the kind of definitive overview of what works in teaching and learning at school that some commentators have suggested. It is a useful and innovative addition to a much bigger picture of evidence, which is how the author himself positions it.


    Fiske, E. and Ladd, H. (2000) When schools compete: a cautionary tale, Washington DC: Brookings Institution Press
    Gorard, S. (2006) Towards a judgement-based statistical analysis, British Journal of Sociology of Education, 27, 1, 67-80
    Gorard, S. (2009) Does the index of segregation matter? The composition of secondary schools in England since 1996, British Educational Research Journal, (forthcoming)
    Gorard, S., with Taylor, C. (2004) Combining methods in educational and social research, London: Open University Press
    Torgerson, C., Brooks, G., Porthouse. J., Burton. M., Robinson, A., Wright, K. and Watt. I. (2004) Adult literacy and numeracy interventions and outcomes: a review of controlled trials, London: National Research and Development Centre for Adult Literacy and Numeracy

  6. Pingback: Psychology journal bans significance testing | ollieorange2

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s