Grades are Guesswork

28/7/2014

Externally assessed grades at both Key Stage 2 and GCSE are effectively guesses. The vagaries of designing and marking written tests of a pupil’s ability mean that the guesses are often wrong, and simply offer a snapshot of a pupil’s actual knowledge, skills and understanding of a given subject rather than their 'true score'. On a different day, with a different test, a pupil is highly likely to get a completely different mark and quite likely to be awarded a different grade entirely, making the grades pupils are awarded guesswork at best.*

Written tests are not very reliable

As Lord Bew noted in his report on KS2 testing, “It is generally accepted that any test or examination, however well constructed, will always include a degree of measurement error. We understand that, as with all tests where pupils are categorised, the level thresholds in Key Stage 2 tests mean that one mark can make the difference between one level and the next. That mark could be lost or gained through a pupil mis-reading an instruction in the test or making a fortunate choice in a multiple- choice question, or through slight variations in marking practice. These differences will be highly significant for the individual pupil.” (p55)

Lord Bew noted that Dylan Wiliam had suggested that 32% of pupils could be given the wrong National Curriculum level. Wiliam noted that, “we must be aware that the results of even the best tests can be wildly inaccurate for individual students, and that high-stakes decisions should never be based on the results of individual tests” (p3) Much of the edifice of data-driven education is built on precisely this type of fundamental flaw, which does not recognise that grades are guesswork.

Grade boundaries are very narrow

The difference between one grade and another is often simply too narrow, and children are miscategorised as a result. This has significant implications for the current system of high-stakes accountability by which teachers and schools are judged. The 'data' used to assess 'performance' is, quite simply, not up to the task.

For example, at Key Stage 2, there are externally marked written assessments of Numeracy, Reading and Spelling Grammar and Punctuation (SPAG). I have looked closely at Numeracy, although it would be possible to look at reading and SPAG in the same way. There are three Numeracy papers, with 100 marks available in total. The grade thresholds vary from year to year. In 2014, they were as follows:

The middle level – Level 4 – has a 32 mark spread, so any child is at most 16 marks away from either Level 3 or Level 5. The key observation is that a significant number of children will be within a hand full of marks of a higher or lower level. Those with either 40+ or 70+ marks could simply have misread a question, missed an entire page (this happens with surprising regularity), managed to transcribe the wrong answer and so on. Some children will, of course, be lucky, and on mistakes and anomalies may even themselves out. For some children, however, they will not.

The reading test has a total of 50 marks, with grade boundaries at 15, 24 and 39 marks. Half of all Level 4 results are within 7.5 marks of the adjacent grade. The SPAG test, there are 70 marks with grade boundaries at 32, 49, 61 marks. 50% of Level 4 results are within just 6 marks of the adjacent grade.

For an individual child, their result is reported as a single grade (this is what the child takes home at the end of year 6). The raw result is also recorded as a single number (a fine grade) which is used by data-crunching analyses such as that used by RAISEonline and the FFT. There is no confidence interval, or range of potential outcomes. The result is a simple snapshot which takes no account of any potential noise in the result.

Whilst, theoretically at least, levels have been consigned to history, children were still awarded levels at the end of Year 6 in 2014, and the current thinking on future assessments (see commentary by Warwick Mansell here)suggest that similar flawed 'data' will continue to be used to assess pupils, teachers and schools.

At GSCE, with many more subjects, things are a little harder to summarise. It is possible to look at some grade boundaries to give an idea of the narrow boundaries between adjacent grades.

An example of a set of grade boundaries can be found on the AQA website. Looking at the grade boundaries for English language on Page 7, shows the following:

This shows around half of students awarded a C grade are within 3.5 marks of an adjacent grade, within 14 marks of an A and 19 marks of an E. A child awarded a C with 60 marks is within 8 marks of an A.

Some grade boundaries are incredibly narrow. In the 2011 PE01 exam, the grade boundaries were a mere 2 marks between the top grades:

Essentially, the chance of being awarded a particular grade comes down to luck.

Castles built on sand

The grades an individual pupil receives are simply guesswork. We should not have allowed anyone to build an accountability infrastructure based on this 'data'.

*The main exception to this guesswork is for children working at the very highest and very lowest levels of ability. If a pupil knows everything there is to know about a subject at KS2 or GCSE, their grade is practically guaranteed, barring some catastrophic performance on the day of the exam, since scored beyond 100% are not possible. Likewise, a child who knows none of the required information to answer the questions in a paper will achieve zero on the test and cannot achieve less.

3 Comments

Chemistrypoet

28/7/2014 02:25:09 pm

I can see that there would always be a concern at the boundaries of levels, and that the concern is bigger the narrower is the gap between the levels. The most obvious way of dealing with this is to make the gap between levels much bigger (by asking more questions, rather than giving more marks per question) so that the percentage of students clustered close to a boundary is much reduced.

Second, what is the variation seen when the same students take the same paper many times in quick succession (without any feedback between)? What is the variation seen when the same student takes different papers designed to do the same assessment in quick succession? Is there any research available on this? In attempting to determine how unreliable the results of such testing are, it would help to have some data on this variation, I think.

A value of 32% is quoted in the blog for the percent of students likely to be given the wrong grade......on what basis was this value calculated, and how representative is it likely to be. You say that these grades are guesswork....but, how good or bad a guess are they (guesswork is really just another name for uncertainty, here, I think).

[by the way, you can see why relative ranking tests are preferred]

Jack Marwood

28/7/2014 04:11:54 pm

Thanks for reading and commenting, CP. it is always good to get feedback on my thoughts.

To take your points in turn, firstly, the question of whether we could ask more questions. This is, of course, a good idea - except that we are asking children to complete these questions, and there is clearly a limit to the length of a written test beyond which it begins to test endurance rather than knowledge...

Secondly, the question about variation and reliability are best answered the paper by Dylan Wiliam which I referenced, which is well worth reading.

My main reason for writing this MiniBlog is to make the point that written tests are extremely unreliable measures of knowledge - I use guesswork for alliterative reasons, and because it indicates a healthy scepticism regarding their validity. 'Uncertainty' is a little ambiguous for my liking and implies doubt rather than distrust.

Its should be clear that I suggest that tests are unreliable, and therefore that the results of tests should not be used as 'data' in the way which they currently are. Judging teachers and schools (amongst others) by what amounts to guesswork is ridiculous, and I hope my post helps to change thinking about results as 'data'.

joiningthedebate link

14/5/2015 03:54:59 pm

Great point about fine grades and their seeming accuracy. I would like to express two ironies (if that's the correct word) that I have been aware of since starting secondary teaching in the early 90s.
At some point we (as a profession) were told not to mark work out of 10. If you did, you were seen as too traditional and not moving with the times, and yet the National Curriculum levels had come along and slapped a NUMBER on the child. Originally I think were were in fact 10 levels but this went down to 8. How hypocritical!
Another irony is that it is also frowned upon to rank students in your class. Remember the 'good old days' when reports actually contained the position in class written in biro by your teacher. So positions are frowned upon and yet students are given a level (then sublevel or even decimal level) slapped upon them which ranks them in the yr group. Added to this league tables rank students and whole schools. I find this also hypocritical.
At a recent parents evening I took a risk. It was a top set yr 7. I wouldn't do this with all classes but I thought I would discuss with parents and student where they were ranked in the class according to my personal scores based on certain classwork which is done individually. They all understood it was not an exact science by any means but appreciated coming away with a clear idea about how they were doing. Much better than that woolly conversation we have about being on track to achieve your personal target. Please don't judge me as to whether you think I am a dinosaur. My main point is about the irony/hypocrisy present

Grades are Guesswork

Leave a Reply.

MiniBlog

Archives

Categories