Written tests are not very reliable
As Lord Bew noted in his report on KS2 testing, “It is generally accepted that any test or examination, however well constructed, will always include a degree of measurement error. We understand that, as with all tests where pupils are categorised, the level thresholds in Key Stage 2 tests mean that one mark can make the difference between one level and the next. That mark could be lost or gained through a pupil mis-reading an instruction in the test or making a fortunate choice in a multiple- choice question, or through slight variations in marking practice. These differences will be highly significant for the individual pupil.” (p55)
Lord Bew noted that Dylan Wiliam had suggested that 32% of pupils could be given the wrong National Curriculum level. Wiliam noted that, “we must be aware that the results of even the best tests can be wildly inaccurate for individual students, and that high-stakes decisions should never be based on the results of individual tests” (p3) Much of the edifice of data-driven education is built on precisely this type of fundamental flaw, which does not recognise that grades are guesswork.
Grade boundaries are very narrow
The difference between one grade and another is often simply too narrow, and children are miscategorised as a result. This has significant implications for the current system of high-stakes accountability by which teachers and schools are judged. The 'data' used to assess 'performance' is, quite simply, not up to the task.
For example, at Key Stage 2, there are externally marked written assessments of Numeracy, Reading and Spelling Grammar and Punctuation (SPAG). I have looked closely at Numeracy, although it would be possible to look at reading and SPAG in the same way. There are three Numeracy papers, with 100 marks available in total. The grade thresholds vary from year to year. In 2014, they were as follows:
The reading test has a total of 50 marks, with grade boundaries at 15, 24 and 39 marks. Half of all Level 4 results are within 7.5 marks of the adjacent grade. The SPAG test, there are 70 marks with grade boundaries at 32, 49, 61 marks. 50% of Level 4 results are within just 6 marks of the adjacent grade.
For an individual child, their result is reported as a single grade (this is what the child takes home at the end of year 6). The raw result is also recorded as a single number (a fine grade) which is used by data-crunching analyses such as that used by RAISEonline and the FFT. There is no confidence interval, or range of potential outcomes. The result is a simple snapshot which takes no account of any potential noise in the result.
Whilst, theoretically at least, levels have been consigned to history, children were still awarded levels at the end of Year 6 in 2014, and the current thinking on future assessments (see commentary by Warwick Mansell here)suggest that similar flawed 'data' will continue to be used to assess pupils, teachers and schools.
At GSCE, with many more subjects, things are a little harder to summarise. It is possible to look at some grade boundaries to give an idea of the narrow boundaries between adjacent grades.
An example of a set of grade boundaries can be found on the AQA website. Looking at the grade boundaries for English language on Page 7, shows the following:
Some grade boundaries are incredibly narrow. In the 2011 PE01 exam, the grade boundaries were a mere 2 marks between the top grades:
Castles built on sand
The grades an individual pupil receives are simply guesswork. We should not have allowed anyone to build an accountability infrastructure based on this 'data'.
*The main exception to this guesswork is for children working at the very highest and very lowest levels of ability. If a pupil knows everything there is to know about a subject at KS2 or GCSE, their grade is practically guaranteed, barring some catastrophic performance on the day of the exam, since scored beyond 100% are not possible. Likewise, a child who knows none of the required information to answer the questions in a paper will achieve zero on the test and cannot achieve less.