Ofsted's use of test scores to judge schools is ridiculous

29/9/2014

Using test scores from assessments at 7, 11 and 16 to judge English schools is fundamentally flawed. There are inherent problems with measuring anything. Any measurement is inherently uncertain, and this is ignored by those judging schools.

In addition, tests have become the curriculum in many respects. But tests can only sample knowledge, and test results are invalid and unreliable when there is teaching to the test. The problems with high stakes testing are well known and need to be shouted from the rooftops.

So whilst we need to filter children as they get older, we award grades rather than raw scores for a reason. Subjecting scores from individual tests – especially tests taken by young children with no stake in their outcome - to complex statistical analysis shows a basic misunderstanding of the problems and complexities inherent in measuring educational attainment, problems and complexities which are clearly understood by many in the educational establishment.

I have written this article to outline the reasons why any use of test scores to judge schools is ridiculous, and so that I can refer to it in future examinations of Ofsted's misguided use of test scores in spurious judgements which continue to be made on English education.

Are you sitting comfortably? Then I’ll begin…

There are inherent problems with measuring anything, never mind something abstract such as knowledge

Better minds than mine have wrestled with the problems of what can be measured, and what any measurement actually measures. Albert Einstein and Niels Bohr, neither of whom were daft, tussled with the idea, as discussed in this excellent blog.

To measure something you need to use consistent intervals, and a unit of measurement. On a ruler, a centimetre is a defined size, and the interval between centimetres is the same size. What is the unit of Education? What is the consistent interval between the units? Chapter 14 of Noel Wilson’s fascinating Educational Standards and the Problem of Error from 1998 explores this thorny issue at length.

But that's not all. There are other problems with measuring knowledge, too. Where is zero? Without zero, how can you compare two measurements in percentage terms? The older a child, the more the child knows, so how do you account for age? How can you use numbers to measure something as abstract as what a person knows?

Any measurement is inherently uncertain, and this is ignored in education

Most other sciences look at Physics with something akin to envy, because Physics is about actual stuff, much of which you can see. And the stuff you can’t see, like fluid dynamics or electromagnetism, you can model with a high degree of accuracy. But even in Physics, the principle of uncertainty in measurement is a core concept. Whatever you measure, you are certainly wrong. There are limits to your wrongness, of course, but you are certainly not right.

So any machinery is built to acceptable tolerances, and the things built on physics which we take for granted – toasters, cars, bicycles, computers – work well because someone somewhere has studied and understood the principle of measurement error. A measurement of something which is inherently unstable should always be reported with an estimate of its observational error.

In education, we don’t deal with stuff you can see, or even stuff which you can model with a high degree of accuracy. We deal with people, and little people at that. Our observational errors are huge. Any measurement we make is inherently uncertain.

Not that this is understood in education. Tests are taken, their results are taken as definitive, and they are used to grade children and judge schools. This is ridiculous in itself, but there are further problems with using test scores in the way in which they are currently used in English education.

In Years 2, 6, 11 and 13, end of year tests have become the curriculum

Since the early 1990s, testing in England has become a high stakes game of school accountability. End of Key Stage tests have come to drive the curriculum in English and Maths (or Literacy and Numeracy, as it was until around five minutes ago). By this, I mean the curriculum as it is defined by schools, not the National Curriculum. I mean ‘what is actually taught in school’.

In schools, teachers are very aware of what their children will be tested on at the end of the year. Year 2 and Year 6 – the years with externally marked tests – are driven almost exclusively by schools’ understanding and anticipation of what is on the end of year tests. I understand from those teaching GCSE and A Level that this is the case in these (externally marked) years as well.

In my experience of Primary school, the end of year tests have become important in Year 3, 4 and 5 too, and the curriculum in English and Mathematics has been narrowed as a result, with other - untested - subjects being marginalised and diminished. Even Ofsted knows this is a problem.

The curriculum has come to be defined by what is being asked of children in the tests they take.

Tests are samples which provide estimates of knowledge, skills and understanding

Until I finally made time to properly research the best that has been written and said about measuring educational attainment, I suspect that I had a typical teacher’s view of the tests which children take. I thought that they were a fairly poor method of assessing what a child had learned, and were clearly biased toward those with, in the jargon, a lot of ‘capital’, both social and cultural.

As a statistician, I was aware that the numerical result of a test was simply one of a range of possible results which a child might be awarded, and that there are a number of factors which affect the result of any test. But I had not really considered what a test was measuring and why.

Having read a lot of educational research written by those who design tests, I have found out a lot which simply had not occurred to me. I suspect quite a bit hasn’t occurred to many teachers and policy makers, either. For example, it hadn’t really occurred to me that tests are designed to sample knowledge, skills and understanding to provide an estimate of the full range of knowledge, skills and understanding (the ‘domain’) a child might possibly have.

The following, from Daniel Koretz’s Measuring Up, published in 2008, summarises this point:

‘The results of an achievement test – the behaviour of students in answering a small sample of questions – is used to estimate how students would perform across the entire domain if we were able to measure it directly.’ (p20)

Creating these samples is incredibly difficult. It simply isn’t possible to test everything which you might want to test. Even if the test is very good at assessing achievement, there may be other aspects of school quality which a test simply can’t tell you, such as whether the children have enjoyed the experience of learning the subject, or whether they have come to hate it with a passion.

A further problem with using tests to judge schools is that this in itself can cause the sampling to be skewed.

If the samples are skewed, the estimates are meaningless

As soon as you understand that tests are simply samples which provide estimates, the problems of high stakes testing become more clear. As Daniel Koretz notes, ‘A failure to grasp this principle is at the root of widespread misunderstandings of test scores.”

There are many things which flow from this principle. One is that teaching to the test distorts what the test is designed to do. If a child has learnt a method to gain marks on the test rather than further their understanding of the subject, the test loses a great deal of its validity. This is also true if an area which never appears on the test is not taught. It can be argued that any teaching which focuses on the test rather than the subject distorts the result.

I could go on, but many, many people have spent a great deal of time describing exactly how much teaching to the test goes on in English schools as a result of the accountability system which has been developed. Warwick Mansell’s Education by Numbers, published in 2008, covered this in exhaustive detail. In his chapter on Primary education, he found that the government’s own figures suggested that test preparation took up 150 hours of lesson time in the four months of Year 6 leading up to the Key Stage 2 SATs, a staggering 44% of available teaching time.

There is a huge dilemma at the heart of testing: Tests estimate domain knowledge using sampling. Test results can be improved by teaching to the test rather than the domain. If teachers and schools are judged by test results, they will teach to the test rather than the domain. If the samples are skewed, the estimates provided by the test become meaningless.

There are other factors which affect test scores

I’ve detailed many of the other things which affect test scores, not least a child’s age within their cohort, and the wider factors affecting children’s educational experience. Added to this is the fact that an enormous number of test results are simply marked incorrectly, and that between 24% and 40% of children have had coaching or tuition by someone outside of school.

All of this is on top of the ridiculous flawed assumption that a given test score is somehow a definitive measure which can be used as if it was actually a number on a linear scale, rather than one of a wide range of possible outcomes on an undefined scale.

Tests should be about children, not their teachers or schools

Finding out what children know is a skill in itself. Some tests can help a teacher to work out what a child knows. External tests and examinations don't do this - they sample knowledge to estimate wider knowledge within a domain Using tests and examinations to hold teachers and schools to account is ridiculous. There are other, better ways in which schools can and should be held to account.

Daniel Koretz sums up the reasons why tests should not be used to judge schools succinctly.

“There are three distinct reasons why scores on one test, taken by themselves, are not enough to tell which schools are good or bad. The first is that even a very good achievement test is necessarily incomplete and will leave many aspects of school quality unmeasured.
The second reason not to assume that higher scores necessarily identify better schools is that, in the current climate, there can be very large differences among schools in the amount of score inflation.
The third and perhaps most important reason scores cannot tell you whether a school is bad or good is that schools are not the only influence on test scores. Other factors, such as the educational attainment and educational goals of parents, have a great impact on students’ performance.” (P325-6)

In the last thirty years, the lunatics have taken over the asylum

A whole world of weird has been developed in education which assumes that, despite the obvious problems with high-stakes testing, single scores can be attached to a child, for which schools are held solely responsible.

It follows that labelling a child 27, or 21 or 35, and then using that number to judge schools, is ridiculous. To use this imprecisely measured value as a definitive measure of ability or to judge 'progress' is nonsensical. Using these measured values in further analysis is grand folly, and Not Even Wrong. Ofsted's use of test results to judge schools is ridiculous, and Ofsted should explore ways to get themselves out of this mess.

As I've noted already, the good news is that Ofsted is listening. We're moving on from the era in which Ofsted directed teaching in schools which were clearly doing a good job. Now we need to get Ofsted to move on from the use of test results to judge schools, and to make decisions based on the broader outcomes of schooling, which can't be measured using numbers.

Noel Wilson’s Educational Standards and the Problem of Error
Warwick Mansell’s Education by Numbers
Daniel Koretz’s Measuring Up

For those who are interested in further reading I recommend Daniel Koretz' and Warwick Mansell's books mentioned above. There is a huge body of evidence that written assessments are a huge and expanding minefield, much of which can be found online. As a starter for ten, here are papers by Dylan WIliam, Robert Coe, and Tim Oates.

Don't forget that I'm appearing at the Battle of Ideas on Sunday 19th October at the Barbican.
Come along and say hello.

2 Comments

Jack Marwood

4/10/2014 08:27:03 am

I've had a few conversations on Twitter which have asked - given the criticisms I've outline here - what the alternative might be. (See here, for example: https://twitter.com/LearningSpy/status/516689596474613760)

Interestingly, no one seems to be able to articulate any view point which is contrary to what I've said about the ridiculous way Test Scores are used to judge schools. It would appear that this orthodoxy simply hasn't been questioned by many people before now.

Looking at suggestions from others who have looked at potential reforms of Ofsted of late, I recommend the posts below. Most were written when there was a belief by school inspectors that they could observe learning in lessons. Ofsted now (officially at least) believe that this is not true.

http://mrlock.wordpress.com/2013/11/22/how-would-i-like-to-be-held-to-account-not-by-ofsted/
http://teachingbattleground.wordpress.com/2013/09/21/what-id-do-about-ofsted/
https://leadinglearner.me/2014/08/26/the-ofsted-collection/
http://schoolsimprovement.net/guest-post-stage-inspections-schools-requiring-improvement/

I hope that at some point soon Ofsted will also recognise that Test Results are not an accurate reflection of the effectiveness of schools, just as their lesson observations were not an accurate reflection of the effectiveness of teachers.

As ever, your thoughts are much appreciated.

16/10/2014 02:25:47 pm

Two weeks on from publishing this it seems that no one is prepared to argue the case that Test Scores are a meaningful measure of anything in particular. I'd still love to see someone make a case if they can.

In the meantime, here are a few more links to useful and interesting material on the problems with Test Scores::
https://ijstock.wordpress.com/2014/10/16/just-another-brick-in-the-wall/ - In which teaching to the test is described and questioned
http://www.telegraph.co.uk/education/educationnews/11161788/Examiners-paid-little-more-than-bar-staff-to-mark-GCSEs.html - at £2 a paper, are mistakes to be expected?
http://www.theguardian.com/education/2014/sep/23/gcse-students-wrong-grades-marked-a-levels - Ofqual's own surveys show teachers believe Test Scores are often wrong.

Your comment will be posted after it is approved.

Ofsted's use of test scores to judge schools is ridiculous

Leave a Reply.

Author

Archives

Categories