The ‘Effect Size’ is not a recognised mathematical technique

Three things you should know about the ‘Effect Size’

1.   Mathematicians don’t use it

2.   Mathematics textbooks don’t teach it.

3.   Statistical packages don’t calculate it.

Despite a public challenge in March 2013, none of the advocates of the ‘Effect Size’ have been able to name a Mathematician, Mathematics textbook or Statistical package that uses it. They are welcome to correct this in the comments below.

John Hattie admits that half of the Statistics in Visible Learning are wrong

At the researchED conference in September 2013, Professor Robert Coe, Professor of Education at Durham University, said that John Hattie’s book, ‘Visible Learning’,  is “riddled with errors”. But what are some of those errors?

The biggest mistake Hattie makes is with the CLE statistic that he uses throughout the book. In ‘Visible Learning, Hattie only uses two statistics, the ‘Effect Size’ and the CLE (neither of which Mathematicians use).

The CLE is meant to be a probability, yet Hattie has it at values between -49% and 219%. Now a probability can’t be negative or more than 100% as any Year 7 will tell you.

This was first spotted and pointed out to him by Arne Kare Topphol, an Associate Professor at the University of Volda and his class who sent Hattie an email.

In his first reply –  here , Hattie completely misses the point about probability being negative and claims he actually used a different version of the CLE than the one he actually referenced (by McGraw and Wong). This makes his academic referencing, hmm, the word I’m going to use here is ‘interesting’.

In his second reply –  here , Hattie reluctantly acknowledges that the CLE has in fact been calculated incorrectly throughout the book but brushes it off as no big deal that out of two statistics in the book he has calculated one incorrectly.

There are several worrying aspects to this –

Firstly, it took 3 years for the mistake to be noticed, and it’s not as though it’s a subtle statistical error that only a Mathematician would spot, he has probability as negative for goodness sake. Presumably, the entire Educational Research community read the book when it came out and they all completely missed it. So, the question must be asked, who is checking John Hattie’s work? As a Bachelor of Arts is he capable of spotting Mathematical errors himself?

In Mathematics, new or unproven work is handed over to unbiased judges who go through it with a fine toothcomb before it is considered to have the stamp of approval of the Mathematical community. Who is performing this function for the Educational community?

Secondly, despite the fact that John Hattie has presumably known about this error since last year there has been no publicity telling people that part of the book is wrong and should not be used. Surely he could have found time between flying round the world to his many Visible Learning conferences to squeeze in a quick announcement.

As one of the letter writer’s stepfather, a Professor of Statistics said

“People who don’t know that Probability can’t be negative, shouldn’t write books on Statistics”

Sources –

Book review – Visible Learning by @twistedsq

Can we trust educational research? – (“Visible Learning”: Problems with the evidence)

EDIT – Since this post we have also discovered why the CLEs are all wrong and the reason is shocking. Read about it here – John Hattie admits that half of the Statistics in Visible Learning are wrong (Part 2).

The Age effect which means the ‘Effect Size’ is useless

In 2007, four American researchers looked at the data from seven national tests in Reading and six national tests in Maths across an age range from six to seventeen. They were looking for patterns in the Effect Sizes.

Empirical Benchmarks for Interpreting Effect Sizes in Research by Hill, Bloom, Black and Lipsey (2007)

Image

As we can see there is a clear downward trend and the hinge figure of 0.40 is never achieved again after the age of 10.

Image

Again there is a downward trend and the figure of 0.40 is never achieved after the age of 11. The authors of the paper also found the same trend when they studied national test results for Social Studies and Science.

This means that Hattie’s hinge figure of 0.40 is spectacularly misleading. Educational research done in Primary schools will usually do better than 0.40, whereas Teachers teaching in Secondary Schools will find that their Effect Size is usually below 0.40 and gets worse the older the children are, no matter how effectively they are teaching.

To get any kind of fair comparison for educational studies, we need to know the age of the children studied, as well as their results. We can then compare fairly with the typical Effect Size for their age range, instead of a headline figure of 0.40.

One possible reason that we are seeing this pattern is that the ‘Effect Size’ is really (inversely) measuring how spread out the pupils are, not how well they are progressing.

In Year 1, there’s not as big a difference between the top and the bottom child, because even the quickest child hasn’t learned that much. This means the standard deviation (how spread out the pupils are) is small. When you divide by something small you get a big number.

In Year 11, the opposite is true, there is a large difference between the top pupils and the bottom pupils. A big spread means a large standard deviation and dividing by a large number gives you a small number.

Hat Tip to @dylanwiliam

How did the inventor of the Effect Size use it? (Not the way Hattie does.)

In 1969, Psychologist Jacob Cohen released his book ‘Statistical Power Analysis for the Behavioral Sciences’. In this book Jacob Cohen introduced the Effect Size for the first time and explained how to use it.

So, how did Jacob Cohen, the inventor of the Effect Size, use it?

Image

Quick translation – I noticed that people in the Behavioral Sciences sometimes did badly designed experiments because they didn’t understand Statistics well enough, so, I decided to help them by making some easy look-up tables.

Image

Quick translation – There are four ways to do Power Analysis, but two of them are rarely needed. The two main ways you need to check your experiment before you do it are, firstly, check the Statistical Power is high enough or alternately check you have planned to test enough people.

Image

Image

Quick translation – To use the Statistical Power tables, you need to know the number of people in your experiment, the Statistical Significance you want and the Effect Size.

And here is a Statistical Power table from Jacob Cohen’s book, notice the Effect Size (d) at the top. There are dozens of pages of these tables in his book.

Image

And here he gives an example of how to use the Statistical Power tables.

Image

The other thing you need to check is the Sample Size.

Image

Quick translation – The other way to check your experiment is with the Sample Size table. To use this your need the Statistical Power, the Statistical Significance and the Effect Size.

And here is a Sample Size table, notice the Effect Size (d) at the top. Again there are dozens of pages of these tables in the book.

Image

And he gives an example of how to use the Sample Size table.

Image

Now, every modern user of the Effect Size cites Cohen and they always quote him about small, medium and large effects. This gives the impression that they are just continuing his work, yet, they are using it in a completely different way to him.

Jacob Cohen, the inventor of the Effect Size, used it to check the Statistical Power and the Sample Size of an experiment before you did the experiment. He did this using look-up tables.

A closer look at Hattie’s top two Effect Sizes

Hattie’s top two Effect Sizes in Visible Learning are

Self-reported grades – 1.44

Piagetian programmes – 1.28

in fact these are the only two that are above 1.0

Kristen DiCerbo has had a closer look at Self-reported grades. Hat-tip to @Mrsdaedalus.

Piagetian stages were proposed by Jean Piaget. Basically, as children develop they pass through various stages of development, firstly with motor skills as babies, then thinking skills as young children.

Piagetian programmes cites only one meta-analysis, Jordan and Brownlee (1981). Unfortunately, I can’t find the full paper, only an abstract. The abstract does show two things though.

Firstly, the original studies weren’t calculated as Effect Sizes, they were calculated as correlations. Hattie has again converted correlation coefficients into Effect Sizes. The study is basically saying that kids who develop faster when they are babies (because they are more intelligent) do better at tests a few years later (because they are more intelligent). Hardly earth-shattering stuff. And the same as the Self-reported grades, this is not an intervention, there’s nothing you can do about it, it’s just a correlation.

Secondly, the students in the study had an average age of just 7 years old. Hattie has used this to extrapolate to all students aged 5 to 18. We teach pupils not to extrapolate outside the data range at GCSE.

Remember that both of these Effect Sizes were used when Hattie calculated his 0.40 average, so, if they are wrong, then so is the 0.40 hinge point. And we could have included any number of correlations in here and changed them to Effect Sizes. It just shows that his 0.40 ‘hinge point’ is completely arbitrary.

Also, it may be worth pointing out at this point the differences between the correlation coefficient and the Effect Size.

The correlation coefficient – Proposed in 1880 by Karl Pearson who is considered by many to be the Father of Mathematical Statistics and founded the first University Statistics department. Explanation of the reasoning behind it and derivation using Algebra in every Statistics textbook. Learnt by Mathematicians either at A Level or first year of University.

The Effect Size – Proposed in 1985 by Gene Glass, an Educational Psychologist. No explanation or derivation ever given even today. Appears in no Maths textbooks. No Mathematician has ever heard of it.

The two kinds of Effect Size

At the start of ‘Visible Learning’, John Hattie talks about the two different ways to calculate the Effect Size.

Effect Size = (Mean of group at end of intervention – Mean of group at start of intervention) / Standard Deviation

and

Effect Size = (Mean of intervention group – Mean of control group) / Standard deviation

Now, obviously, real Mathematical things don’t have two different definitions because of the confusion this causes as we shall see.

The problem is that the first definition measures actual improvement whereas the second measures relative improvement.

To give an example, imagine Ann and Bob are both travelling to an education conference. They set off at the same time, driving their cars down the motorway.

We know that Ann drives at an average speed of 40 mph. How can we tell if Bob will get there first?

We could give his actual speed of 50 mph.

Or we could give his relative speed, he’s travelling 10 mph faster than Ann.

It doesn’t matter which one we use as long as we all know which definition we’re talking about.

The problem comes when we start using the same words to mean two different things and start mixing them up.

When comparing Bob’s speed to Ann’s, a good actual speed would be anything over 40 mph, but a good relative speed would be anything over 0 mph.

Now, if I say that Cathy is going at a speed of 30 mph, is that an actual speed, in which case she’s going the slowest, or a relative speed in which case she’s travelling the fastest?

Hattie only mentions the two different types of Effect Size once, at the start of the book, but the way he talks later on “Everything works” and “We need to compare with 0.40” shows that the definition he is using is the first one, measuring actual improvement. However, has he made that distinction when he was choosing his studies? I suspect that he has not realised that the two different ways of measuring would produce very different answers and he has just thrown both types of study all in together.

For John Hattie, any Effect Size bigger than 0.40 would be ‘good’.

Now which version does the Education Endowment Foundation use?

In their recent Chatterbooks study they say

http://educationendowmentfoundation.org.uk/projects/chatterbooks/

ty34

 

erghw

 

This and the fact that they use control groups show that they are using the second way of calculating the Effect Size, the relative way.

So, for the Education Endowment Foundation, anything better than zero would be ‘good’.

So, to sum up, the two major players using the ‘the Effect Size’, John Hattie and the Education Endowment Foundation, are actually using it to mean two completely different calculations, one actual and one relative. For one of them, anything above 0.40 would be ‘good’, the other anything above zero would be ‘good’.

 

Correspondence with John Hattie

Shortly after I started this blog, after I’d done a few posts, I wrote to John Hattie at the University of Melbourne pointing out some of my concerns. One of the things I pointed out was that he claimed the ‘Effect Size’ had units of standard deviation when it can be shown mathematically that it actually has no units (and it’s fine for it to have no units as long as you realise that).

In fairness to him, he wrote back quite a long letter taking each of my points in turn. When it came to my ‘the Effect size has no units’ point he said –

“It is not correct to claim that the Effect Size has no units, it does, from -infinity to +infinity but more normally between -3 and 3”

Now, up to this point, I couldn’t quite believe all that I’d found out about the Effect Size. I would say to myself ‘the Effect Size is wrong and you’re the only one who noticed. Yeah, right!’ I was constantly searching my mind thinking ‘You’ve missed something, what have you missed?’

When I read this statement from him my mouth just dropped open.

Not only does John Hattie not know what units the ‘Effect Size’ is measured in, he doesn’t even understand what units are. What he’s quoted are not the units but the typical magnitude of the ‘Effect Size’ as found in Education research. This is an error which throws doubt on John Hattie’s basic mathematical competence.

To give you an example of how big a gaffe this is, imagine you asked a Physics Teacher what the units of speed for a car are. ‘The units of speed of a car are between 0 and 70’ they answer. No, the units of speed are miles per hour (or kilometres per hour or metres per second). That is a significant mistake and you wouldn’t have a great deal of faith in the ability of the person who said it afterwards.

Is his letter, John Hattie also admonished me, saying that he had read my blog and felt I made too many remarks about him for him to leave a comment. He said –

“. . . in Academia the criticism is of ideas not people”

Which is fine except most people have no way to gauge whether or not a Mathematical argument is correct or not so they might need to rely on other questions to guide them, questions like –

– Do other people in relevant fields use this?

– What is the competence of the person using this?

Now these questions won’t give us the definite answer to the use of the ‘Effect Size’ but what they may do is indicate an area of concern that may be worthy of further investigation.

The answer to the first question is, Mathematicians and Scientists have never heard of the Effect Size, in fact only Psychologists and Education Researchers use it.

If you’re going to use Maths that Mathematicians don’t you’re either a genius, or you don’t know what you’re doing.

John Hattie is an Arts Graduate, who doesn’t understand what units are, nor the importance of getting them correct. I’ll leave you to ponder for yourself which he is.

 

 

 

Why the EEF report on Philosophy for children has no tests for statistical significance

I noticed a few days ago that people were expressing surprise on Twitter that the EEF report on Philosophy for children had no tests for statistical significance.

The problems with the statistics have been written about in greater detail here and also here.

OK. Maybe some people haven’t read my previous blogs or believed what I’ve said before (and it is quite shocking) so I will briefly explain again.

When Mathematicians invented modern-day Statistics in the 1930s, they needed a way to see if results from an experiment were a real effect or just randomness. (For example, I throw a coin 10 times and it comes up Heads 7 times, it’s probably just randomness. I throw a coin 100 times and it comes up Heads 70 times, it’s probably biased.) So, Mathematicians invented statistical significance and p values to separate randomness and real effects.

Now, along come some Psychologists. They said “Mathematicians are a bunch of idiots and they’re doing this all wrong, let’s invent our own way of doing things’. So they invented the Effect Size. Mathematicians and Scientists have continued using statistical significance and Psychologists and Educationalists have continued using the Effect Size. They have said repeatedly that Null Hypothesis Significance Testing (i.e. the way Mathematicians and Scientists do things) is wrong.

wegw

Statistics Hell

This kind of thing is repeated on numerous Social Science websites.

So, you’ve really got to understand this, it’s not a case of them choosing one technique over another.

The people who use the Effect Size think that statistical significance testing, i.e. the way Mathematicians and Scientists do things is wrong and they have invented their own way of doing Statistics. 

You’ve really got to grasp that to understand what I’ve been saying in my blogs.

How could all those people be wrong?

“How could thousands of Psychologists and Educationalists all make the same mistake? Entire fields doing incorrect Statistics. It’s simply not plausible.”

On Thursday night I read a piece called ‘The Art of being Right’ by Arthur Schopenhauer. Underneath I reproduce a few paragraphs from a section entitled ‘Appeal to Authority rather than Reason’.

“When we come to look into the matter, so-called universal opinion is the opinion of two or three people; and we should be persuaded of this if we could see the way in which it really arises.

We should find that it is two or three persons who, in the first instance, accepted it, or advanced it and maintained it; and of whom people were so good as to believe they had thoroughly tested it. Then a few other persons, persuaded beforehand that the first were men of the requisite capacity, also accepted the opinion. These, again, were trusted by many others, whose laziness suggested to them that it was better to believe at once, than to go through the troublesome task of testing the matter for themselves. Thus the number of these lazy and credulous adherents grew from day to day; for the opinion had no sooner obtained a fair measure of support than its further supporters attributed this to the fact that the opinion could only have obtained it by the cogency of its arguments. The remainder were then compelled to grant what was universally granted, so as not to pass for unruly persons who resisted opinions which everyone accepted. 

Since this is what happens, where is the value of the opinion even of a hundred millions? It is no more established than a historical fact reported by a hundred chroniclers who can be proved to have plagiarised it from one another; the opinion in the end being traceable to a single individual.”

Gene Glass should be the most famous man in Education. He is the person who changed the way the ‘Effect Size’ is used and spread its new use throughout Education. He became an Educational Psychologist in 1964. In the early Seventies he was receiving Psychotherapy and decided it had helped him so much that he wanted to prove to everyone that Psychotherapy worked. He’d learned about the ‘Effect Size’ from Jacob Cohen’s book ‘Statistical Power Analysis for the Behavioral Sciences’. (Jacob Cohen originally invented the ‘Effect Size’ and wrote a 500 page book explaining how to correctly use it to find the number of people you needed for your experiment.) Glass decided to completely change the way Jacob Cohen used the ‘Effect Size’, throw away the carefully constructed statistical look-up tables and use it for a completely different reason, sticking results together. While he was doing this, Glass was also elected as the President of the American Educational Research Association. He used his Presidential address to 1,500 educational researchers to announce his new method of putting results together using the new way of using the ‘Effect Size’. How many of those researchers would have thought that there was any element of doubt in what this eminent man was telling them at this prestigious occasion? How many of them would have had the necessary expertise to tell if it was correct or not? Glass wrote a 2 page pamphlet justifying his new way (this has a few sketches on it as proof) and published an article with his wife, Mary Lee Smith, in ‘American Psychologist’. Psychologists and Educationalists all started to copy him and the new method spread throughout Psychology and Education.

So, imagine all the children of the world, underneath them, supporting them are the teachers from all the different countries, underneath them is the whole of education research and all of this, resting on his shoulders, is just one man, Gene Glass. Given that Mathematicians have never taken the remotest bit of interest in the ‘Effect Size’, are we absolutely sure he’s correct?