Whilst I’m not a big fan of aphorisms in blogs, this one seems appropriate. Last year, I started writing about the many frustrations I have had with the misunderstandings, and subsequent misuse, of basic techniques used to shed light on any numbers collected in education. I've found a few others who share the frustrations I have with the bad use of statistics in education, and it looks like the issues we’ve highlighted are beginning to be taken seriously, and that we’re beginning to see some small steps on the way to sensible use of data in schools.
Some background, then. I wrote a piece called RAISEonline is contemptible rubbish in which I discussed one of the fairly simple techniques used to compare numbers, that of tests of statistical significance. I suggest you read it if you haven’t done so already.
Following my piece about RAISEonline, physicist Philip Moriarty wrote Lies, Damned Lies and Ofsted Pseudostatistics. We were then invited to meet statisticians from Ofsted and the DfE. I wrote this post (Hammering Nails in RAISEonline's Coffin), explaining in a little more detail why the tests of statistical significance used in RAISEonline should be abandoned, and then this post (Meeting Ofsted to discuss data) after we had met Sean Harford and the Nameless Ones.
In the post about meeting Ofsted I said the following:
“I can’t say too much, other than to say that I expect to see changes in the future. The weight of children in a class represents that class and nothing else, and can’t be said to be drawn from a wider population, because schools are located within their geographical region and admit children according to their own criteria.”
Ch-ch-ch-ch-Changes
On the 16th July, the DfE released some research on the forthcoming Reception Baseline Assessments. This is from page 4 of the Reception baseline research: results of a randomised controlled trial report:
A really simple explanation why
Once again, I’m going to patronise you a little, I'm afraid. If you understand the significance of the sentence I've highlighted above, good for you. Many people will not, however, so I’ll make it clear.
- The DfE research report above showed that, if you were to assume that a group of children in a particular school were a representative sample of the whole population of school children, you would have to conclude that the two groups they had developed and tested were different in way which a standard test of statistical significance would suggest was not due to chance.
- This is what RAISEonline - the main data set used by Ofsted, schools and the DfE to assess schools' effectiveness - does when it colours data blue and green, and when it constructs confidence intervals to test whether test results are statistically significant. It suggests that items marked blue and green should be assumed to be likely to be unusual and represent meaningful differences between the school and the national population. Essentially, green is ‘Yay!’ and blue is ‘Boo!’.
- The DfE research report above states categorically that children are “clustered within schools”. This means, as I’ve argued previously, that each school population represents itself only, and cannot be assumed to be drawn independently from a national population.
- The report then goes on to conclude:
- If the DfE has done this in this research report, it should apply the same thinking to all data which is clustered in schools. RAISEonline does not do this. It uses statistical significance incorrectly, in a way which this report (quite rightly) specifically rejects.
- Green and Blue should be removed from RAISEonline and tests of statistical significance, which require independent and identically distributed data, should not be used unless data is from random samples of a population.
James Pembroke, who has been a stalwart in raising the issues within RAISEonline, has also written about this in his blog The Gift. As he says, “This is something that Jack Marwood, myself and others have been trying to get across for a while - that there's isn't a cohort of pupils in England (or maybe anywhere for that matter) that can be considered to be a independent random sample. Not one"
I look forward to seeing what happens next.