# The Age effect which means the ‘Effect Size’ is useless

In 2007, four American researchers looked at the data from seven national tests in Reading and six national tests in Maths across an age range from six to seventeen. They were looking for patterns in the Effect Sizes.

Empirical Benchmarks for Interpreting Effect Sizes in Research by Hill, Bloom, Black and Lipsey (2007)

As we can see there is a clear downward trend and the hinge figure of 0.40 is never achieved again after the age of 10.

Again there is a downward trend and the figure of 0.40 is never achieved after the age of 11. The authors of the paper also found the same trend when they studied national test results for Social Studies and Science.

This means that Hattie’s hinge figure of 0.40 is spectacularly misleading. Educational research done in Primary schools will usually do better than 0.40, whereas Teachers teaching in Secondary Schools will find that their Effect Size is usually below 0.40 and gets worse the older the children are, no matter how effectively they are teaching.

To get any kind of fair comparison for educational studies, we need to know the age of the children studied, as well as their results. We can then compare fairly with the typical Effect Size for their age range, instead of a headline figure of 0.40.

One possible reason that we are seeing this pattern is that the ‘Effect Size’ is really (inversely) measuring how spread out the pupils are, not how well they are progressing.

In Year 1, there’s not as big a difference between the top and the bottom child, because even the quickest child hasn’t learned that much. This means the standard deviation (how spread out the pupils are) is small. When you divide by something small you get a big number.

In Year 11, the opposite is true, there is a large difference between the top pupils and the bottom pupils. A big spread means a large standard deviation and dividing by a large number gives you a small number.

Hat Tip to @dylanwiliam

## 28 thoughts on “The Age effect which means the ‘Effect Size’ is useless”

1. Jan Tishauser

You seem to find Hatties work of great importance, considering your investment of time and energy into his work. What do you think about his main findings? For instance that system changes, like class size or funding, have a far lower effect than the pedagogic behaviour of a teacher. Or his findings about the effectivity of direct instruction and mastery learning in comparison to discovery learning? Are these findings similar to your views on education?

By the way, a more detailed explanation on how to calculate effect sizes is provided in the appendix of “Visible Learning for Teachers”.

You can also use the following source for the calculation of effect sizes:

http://bit.ly/1hWchx7

This guide is made by Prof. Rob Coe of Durham University. You will notice, that he uses the SD of one class instead of the pooled SD. But he to uses the same SD for both classes.

• I am quite aware of how to calculate Effect Sizes having read Jacob Cohen’s ‘Statistical Power Analysis for the Behavioral Sciences (1969)’ which is the book where the Effect Size was first used and ‘Statistical methods for Meta-Analysis (1985)’ by Larry Hedges and Ingram Olkin where the Effect Size was developed.

There are several ways to calculate an Effect Size, you can use Cohen’s d, Glass’ Δ or Hedges’ g. None of these are used by the Mathematical community.

2. Luc Kumps

By the way, in “Visible Learning for Learners” Hattie refers to the analysis of Hill et al. He concludes: “Please note that I did not say that we use this hinge-point for MAKING DECISIONS, but rather we use it to START DISCUSSIONS about the effect of teachers on students” (p. 17)

• Does he mention that there is a consistent age pattern in all Effect Size data that he completely missed despite studying it for 10 years?

• Luc Kumps

“Hill, Bloom, Black and Lipsey (2008) analysed the norms for 13 major standardized achievement tests (in USA), and found an average growth in maths and reading of about 0.40 – and, like in the NZ sample, the effects for each year were greater in the younger and lower in the older grades. So while d = 0.40 is a worthwhile average, we may need to expect more from the younger grades (d > 0.60) than for the older grades (d>0.30). I choose this average (0.40) as the benchmark for assessing the influence that teachers have on achievement. In my work with schools since the publication of Visible Learning, we have used this hinge-point as the basis for discussions” (Hattie, 2012, p. 17)

• Luc Kumps

This is the research of Hill et al. (2008) that Hattie refers to in Visible Reading for Teachers (2012): http://www.ncaase.com/docs/HillBloomBlackLipsey2007.pdf
Excluding the outlier of Grade 12, the averages are 0.48 SD annual growth for Reading and 0.52 for Math. Including Grade 12: 0.45 and 0.48. Kindergarten, Grade 1 and Grade 2 show twice or more that growth rate.

Two years ago, similar research was presented (Lee, Finn & Liu, 2012), yielding similar results.

Hattie, J. (2012). Visible learning for teachers : maximizing impact on learning. London; New York: Routledge.

Lee, J., Finn, J. & Liu, X. (2012). Time-indexed Effect Size for P-12 Reading and Math Program Evaluation. Paper presented at the Educational Effectiveness (SREE) spring 2012 conference, Washington, DC. Retrieved from http://gse.buffalo.edu/faculty/centers/ties

3. Just look at the pattern. How can you say ‘0.40 is a worthwhile average’?

4. Dylan Wiliam

A more important problem is that we need to be clear about the time interval for which the effect sizes is being quoted. For example, in the King’s-Medway-Oxfordshire Formative Assessment Project, we quoted an effect size of 0.32 which was in comparison with other students in the same school over a whole year. So if the control students progressed by 0.4, then the experimental students progressed by 0.72. The effect size we published was 0.32, because this is in addition to “business as usual” but some people don’t appreciate that this is in addition to the normal expected growth. Moreover, those who slavishly follow Cohen’s 1988 book—using what Russell Lenth describes as “tee-shirt effect sizes” (small, medium, large)—would describe an effect size of 0.32 as “small” but the effect is rather substantial, being equivalent to an 80% increase in the rate of learning.

5. Wayne

Thanks for this article. I agree 0.40 being used in classrooms is misleading at best; however, effect sizes are still extremely useful if you compare them to a mean effect size, for example, use the scale score from the 50th %ile and measure growth compare to the mean effect size.

6. Effect sizes are certainly not ‘useless’, but recent research shows they need to be interpreted with much more caution than they have been in the past.

They are not ‘useless’, as they are the only real means we have of comparing different factors that might affect achievement, so they can, if used with caution, be used by teachers to sift out factors that are worth experimenting with, from those which are not worth experimenting with.
If you abandon effect sizes, you are left with professional judgement, and that is much less reliable than effect size.
Olie needs to explain how he would recommend we sift the best from the worst teaching strategies.
He also needs to explain why, when teachers are trained in the use of high effect size methods, and teachers work in communities of practice with these methods to perfect their use of them, these teachers get a huge improvement in students’ learning. Dylan Wiliam’s work for example, described in the comment above.

• They are not the only way of comparing different strategies. There is a world of Statistics, proper Statistics, invented by proper Mathematicians out there not made up by Psychologists with no Maths training.
You need to do experiments. At its most basic, one class uses one method of teaching, another uses another method of teaching. Compare their progress at the end.

• Yes but you will use different tests in each of these experiments, so you need a common metric (unit) to compare the many thousands of experiments on different teaching strategies etc that have been done. Othwise you can’t compare.
You can’t compare, say, percentages that students get in the tests, as all the tests are of different difficulty. So the metric used is the effect size, this is very widely accepted in the world of educational research, and I don’t know of any other method that allows comparison.

• I understand what they think they’re doing, I’m just telling you, as someone with a Maths degree, that they’re not doing that. Have you ever considered that it simply might not be possible to properly compare tests with different metrics?

• I’m aware it is not easy to compare tests, and that hundreds of thousands of researchers use effect size as their way preferred way of overcoming this problem. I imagine if there was a better way they would use that, and if it were impossible they would not try to do it.

• Yes, thousands of Education Researchers, not Scientists, not Mathematicians. You only need one person to get it wrong at the start and then every-one else copies them unthinkingly. This is what has happened.

• The idea of a meta-study was developed by Gene Glass who studied maths and scientific investigatory procedures at university to Phd level. Since then, similarly qualified mathematically and scientifically competent people have kept meta-study production under constant review, including the use of effect sizes and the processes involved in creating meta-studies.
You are right that there are problems, effect sizes and meta-studies are not perfect by any means. But along with other studies, such as cognitive psychology, and studies into how excellent teachers work, they help teachers choose which changes to their practice might be productive, and which might not. This is vitally helpful. How else are teachers to choose between rival ideas as to how to improve things for students? There is no other way.
However, the final evidence is the teacher’s own evidence from their own practice, as to whether the methods or changes they have experimented with, are working for them and their students.

7. George LILLEY

This is a pretty big claim Geoff- “If you abandon effect sizes, you are left with professional judgement, and that is much less reliable than effect size.”

Have you got any evidence for this?

• A teacher wishing to improve needs to decide what methods, strategies, techniques etc are worth experimenting with. They can use effect sizes as a rough filter, and experiment with methods that have a large effect size in very many experiments, arguing that if the methods work well in other classrooms, they MAY work well in mine.
In the CONTROL group of the experiments that give rise to a high effect size, the teacher is using their own professional judgment. In the EXPERIMENTAL group the same teacher is using a prescribed method they don’t usually use. So a high effect size identifies a strategy that beats the professional judgement of the teachers in the experiments. Teachers in experiments are almost always experienced and competent, though not necessarily exceptionally capable. They are a bit better than average perhaps, certainly not worse.
So we have hundreds of thousands of experiments that show that the average teacher’s professional judgment can be beaten by some clearly identified strategies.

• George LILLEY

I guess you are talking about Hattie’s evidence. Well apart from all the calculations and interpretation issues mentioned in this blog how would you measure Hattie’s highest influence self-report grades? Hattie gets an astounding result d=1.44 and interprets that to mean accelerates achievement by 3+ years.

So my judgement and many other teachers is that’s not possible.

For a start how would you set up an experiment to measure an influence like self-report and separate it from other teaching strategies? If you use your methodology how would you deliver the strategy – self-report? Then you need to measure student performance say in 6 weeks time. Then you need to set up a control group, how would that run?

Here is a bit more of a detailed analysis of Hattie’s evidence for self-report – http://visablelearning.blogspot.com/p/self-report-grades.html

• You’re right that ‘self reported grades’ can’t lead directly to an intervention, but nearly all the other ‘influences’ on Hattie’s list can. The research is all carried out by others, Hattie has very helpfully summarised the summaries or reviews of this research. I would call ‘self reported grades’ a correlation with achievement, rather than a probable cause of achievement, it shows that students can guess their grades remarkably accurately, though most evidence is from Higher Education.

However most items on Hattie’s list can be directly tried, as can high effect size methods recommended by Marzano. For example what I call ‘medal and mission feedback’; goal setting in advance; or class discussion; these, and many many others are easily tried in the classroom. The teacher does not need to do their own control group/experimental group study, these have been done already. They just need to understand why the methods work, and then try them out in an action research sort of way as i describe here.

Incidentally this is not the way Hattie suggests, which he sets out in Visible Learning.

My main point is that summaries of research, collected by Hattie and Marzano, point to methods that have worked exceptionally well in repeated, rigorous classroom trials, and that the effect sizes from these trials, while imperfect in many ways, are still a useful gauge of what might work well in your own classroom. It seems crazy not to make some use of effect sizes in your assessment of what to try and what not to.

• George LILLEY

You are doing some good stuff Geoff, similar to the things that i’ve seen that work – teachers get together and develop lessons together and one person teaches the lesson while the others observe and make improvements (but this is expensive to do). The things that have been developed along those lines for maths teachers here in Melbourne, are maths300 and maths with attitude. We have not used Hattie, although Marzarno is a little more useful.

Whats your take on Hattie having homework a small effect, while Marzano has it as one of his key strategies? You emphasize extended practice – i took that to mean you are pro homework.

Re self-report being the only influence that is difficult to measure, i disagree – For example, how do you separate these influences from each other: feedback, class management, group work, expectation, motivation, personality, problem solving, creativity, peer influence, welfare, diet, etc? That’s why Hattie mostly uses correlation studies, not true experiments like you are doing.

After reading a lot of Hattie’s stuff, (apart from his research being so poor), I don’t think it is worthwhile to separate influences anyway. Teaching is complicated because it involves a lot of variables. It is not a precise science; what works in one class does not work in another and the teacher needs to adjust and be flexible (at least Marzano agrees with me on this point).

The only thing Hattie seems to be used for here is to justify decreased spending on education and to distract us from the more important issues of social inequity in Education, and how to deal with the complexity of students.

• The summaries of research on teacher improvement don’t suggest observation, but instead what I call ‘Supported Experiments’ http://geoffpetty.com/for-team-leaders/supported-experiments/

Helen Timperley has produced a very useful summary of research on how to improve teaching.

Click to access EdPractices_18.pdf

When I last read the summaries of research on homework they said it didn’t work in Primary, but did in secondary and beyond, but worked best if there was feedback to the learner on how well they did the homework. Hattie makes the same point in Visible Learning. If you do it right, homework can be effective, but I worry that students don’t have enough time to explore life if they have too much homework.

I agree feedback, class management etc all affect each other. However that doesn’t mean you can’t look at them separately in a useful way. If you look at most writing on education authors look at issues like feedback or class management separately, and this can be helpful.

I don’t do research myself, I look at summaries of the best research done by others, as this is the most authoritative source of advice.
http://geoffpetty.com/the-uses-and-abuses-of-evidence-in-education/
I agree that just because something works in a control group study it can’t mean it must work in your own classroom. That is one of the reasons I like Supported experiments.