Assessments are imperfect; here’s how to make yours as reliable as possible

By Joshua Perry,

We’ve asked Joshua Perry, education technology expert and entrepreneur, to write a series of blogs about analytics and assessment. The first instalment examines why we bother with analysis in the first place, the second discusses analytics for classroom teachers and the third discusses analytics for MAT leaders.

This is the fourth and final instalment, which looks at the general ways in which educators can get the most out of their assessment data. Joshua is on Twitter as @bringmoredata

Many people who’ve worked with school assessment data will recognise the following journey. Let’s call them the four stages of assessment expertise:

STAGE 1 – ARRRG! What do all these dizzyingly complex terms mean and how am I ever going to remember the difference between ARE, EYFS, WTS, A8, P8 and EBACC?? Don’t even think about asking me to explain the difference between standardised scores and the DfE’s scaled scores!

STAGE 2 – PHEW! I now just about know the difference between all the dizzyingly complex terms. Now I just need to work out how to do some actual useful analysis.

STAGE 3 –  WOOP! I’ve gathered data and generated some super-cool analysis! I’ve GOT this!

STAGE 4 – ARRRRGG! Now I understand how assessments work in the real world, I see uncertainty and imprecision everywhere!

That’s certainly a version of my own story, anyway. And honestly, the main simplification this glosses over is that you never completely move on from STAGE 1 – I still frequently find myself googling some niche policy point or other. And I once oversaw systems and data for a Multi Academy Trust then founded a schools data platform… so if I’m frequently bamboozled, I’m going to be brave and guess that sometimes you are too.

Anyway, apart from acknowledging that assessment is a complicated area, the real point here is to emphasise the horrible sinking feeling you get when you arrive at STAGE 4. In my first year working with school data, I confess I spent minimal time thinking about how grades come into existence, and whether or not they’re reliable. I suspect this is partly because I’ve never been based in a school (though being a chair of governors helped); but it’s also partly because in other sectors I’ve worked in, numbers could often be taken at face value. For example, I once founded a publishing company. Our main metric was book sales. Nobody ever had a philosophical debate about whether a transaction was achieved, or just “working towards” being a sale.

Then, when I did start thinking about the assessments themselves, I didn’t like what I discovered. It turned out that teacher assessed grades were the SUBJECTIVE VIEW OF A TEACHER! Summative assessments were OFTEN WRITTEN BY THE VERY SAME TEACHERS WHO USED THEM! Schools were reusing assessments, so IT WAS EASY TO TEACH TO THE TEST! KS1 Phonics results have a DEEPLY WEIRD DISTRIBUTION (particularly when you know that the pass mark is always 32)! I could go on.

So, if assessments can’t always be trusted, shouldn’t we just give up on school data entirely?

I don’t think so, but I do think we need to reflect on the quality of our data before we do any analysis whatsoever. I also think we need to make the quest for reliability central to any assessment approach. 

The problem is particularly acute for summative assessment, since the tests are more complex and the stakes are higher, so that’s where I’ll focus for much of the rest of this blog. Here are my top tips to avoid being overly impacted by unreliable summative assessment data:

  1. Have an assessment policy. I banged on about this in the first blog from the series (which contains more details on what to put in your policy), and I’m going to keep doing so until every school has one! Without a policy, how can you expect different teachers (and different schools, if you’re a MAT) to approach assessments in the same way?  
  2. Summative assessments can be formative, but formative assessments should never be summative. Summative assessments contain tonnes of formative value at the question and strand level (that’s a strength of Renaissance’s Star Assessments), but the tendency to extract summative insights from formative assessments should be resisted. In theory you could ask every student in every school in a MAT to sit a trust-wide vocab test every week, then analyse those results centrally… but at what cost? There’s no basis for thinking that the same set of formative questions suits every class – lower sets may not be able to access the same questions as higher sets, for example. Moreover, suddenly teachers will be focusing on doing well in that test as if it had high stakes attached, because they know somebody central will be seeing the results. In doing so they may ignore gaps from previous weeks, even though their professional judgment tells them the catch-up work should be the greater priority. On which note…
  3. Don’t give teachers incentives to distort data. Look, if you want data to be reliable, you need the teachers who are managing the testing process to be on board with your policy. If they feel their job is at risk if their students do badly, you’re creating a powerful perverse incentive to skew results. If you give a teacher a dressing down when they undershoot expectation, you’re not analysing the reason for the poor performance; you’re just making them fear the process. 
  4. Create a collaborative culture. The best school data cultures I’ve seen have been thoughtful above all else. While I was at Ark, schools held data collaboration sessions where leaders from different schools came together to discuss their own and each other’s data. The priority was insight and discovery; not winning and losing. Reliability can be part of these conversations as well: i.e. was the test pitched at the right level, and did questions perform as desired?  
  5. Prioritise well-designed tests over subjective judgments. People sometimes assume that test-boosters like me don’t trust teachers to make professional judgments. I think that’s a straw man. Rather, I think any practitioner in any sector should want to be able to refer to a well-designed and objective test if given the option: doctors don’t disregard blood tests because they think they know why a patient is yellow! That said, clearly the test isn’t the end of the analysis: professionals interrogate results and overlay other information before deciding what actions to take. So I agree that tests shouldn’t be seen as the only piece of relevant data on how a student is performing, but I do get worried when schools ask teachers to submit summative judgments without reference to a decent assessment. Mind you, this all presupposes that the test itself is well-written, and that’s no easy task. So, if you’re responsible for creating good assessments, you may want to consider CPD from specialists such as Evidence Based Education (EBE).  
  6. Standardise wherever possible. Of course, this can involve using standardised assessments for subjects where these are commercially available (mostly in Maths and English). However, new and exciting ways of standardising in other subjects are emerging, and MATs are leading the way. Many MATs now have 1,000+ students in each cohort, and that’s a pretty good sample size for standardisation. So, increasingly MATs are setting common assessments across their schools then standardising the results in-house. My previous blog, Analytics for MAT Leaders, has information on how Ark handle this process. 
  7. Consider using moving averages. Assessment data can be messy; individual grades frequently bounce around between checkpoints. It’s rarely helpful to dwell on these variations – they may just be a symptom of the imperfect reliability of the underlying assessment. Instead, by combining results from multiple data points, you may get a more reliable picture of performance. The average grade for a student across multiple checkpoints (i.e. the last three assessment windows) will fluctuate less than individual results, and so changes will be more meaningful.
  8. Think about what you’re measuring. I could hardly write a whole blog about assessment reliaibilty without referring to Daisy Christodoulou – her book Making Good Progress? is a great starting point if you want to understand assessment in more detail. This blog – which compares education with marathon running – is also helpful for thinking through what you’re measuring, and why: just as you wouldn’t train for a marathon by only running marathons, it wouldn’t make sense to prepare for a GCSE exclusively by taking past papers. A good summative assessment should test what the student could know (i.e. what has been taught); and it should also bring in questions based on curriculum taught prior to the current term or year. That calls for careful question selection.

Finally, I’d suggest that it actually somewhat liberating to remember that all assessment data is unreliable. In the words of Professor Stuart Kime from EBE

“Because it is a proxy for something unseen, and because interpretation is often part of making sense of the information derived from an assessment, error is always present in some form or other.”. 

– Stuart Kime, ‘Four Pillars of Assessment: Reliability’

I like to use the analogy of opinion polling when thinking about assessment reliability. We understand that polls don’t accurately reflect actual election results (or whatever it is they’re examining); they’re an approximation based on a partial sample. Well, an assessment has a similar relationship to its subject: it can’t ever be a precise representation of a student’s knowledge, because it is only sampling a small fraction of the domain. That may sound disheartening, but it doesn’t need to be: you can still extrapolate lots from a well-designed assessment. The point is simply that recognising the limitations of your dataset is the best start point for any analysis. So instead of getting depressed on arrival at Stage 4 of your journey towards assessment expertise, let it free you! Rather than agonising about a lack of reliability, embrace that reality and build the best possible system you can. Your caution and lack of certainty may end up being the best thing about the process you put in place.

Joshua’s blog series can be read here. To see how we’re supporting students and teachers during school closures, click here. You can follow Joshua on Twitter on @bringmoredata and Renaissance on @RenLearnUK

Joshua Perry

Monthly newsletter