Comprehensive assessment systems: Finding assessment’s Holy Grail

May 9, 2024

By Dr. Gene Kerns, Chief Academic Officer

Both history and mythology are filled with memorable quests and searches. Juan Ponce de León sought the Fountain of Youth, Don Quixote strove to live a chivalrous life, Captain Ahab pursued the white whale, and numerous explorers have sought the Holy Grail.

We typically use the metaphor of a “holy grail” to reference an ultimate achievement—one so aspirational that it is nearly impossible to accomplish. The metaphor suggests that, in searching for a holy grail, you will likely be looking for something you may never find. This implies the search could ultimately be futile.

When it comes to a single, ideal K‒12 assessment tool, some school and district leaders continue a quest that is futile. But as I explain in my new Masterclass video, a recent experience reminded me that assessment’s holy grail does indeed exist, provided that you understand what you’re looking for.

Answering essential questions about student learning

One of my favorite thought leaders is Dr. Dylan Wiliam. He is co-author of the 1998 study Inside the Black Box, the findings of which have driven much of our current interest in formative assessment strategies. I’ve been blessed to engage with Dr. Wiliam on multiple occasions, and I find him to be the most voracious consumer of educational research I’ve ever encountered and an absolutely brilliant mind.

During the 2023 ASU+GSV Summit, Wiliam was part of a panel discussion on “Meeting Kids Where They Are: The Future of Assessments and Testing.” A variety of topics were covered during this session, and some profound insights were offered. I’d like to review and amplify one of those insights here.

After speaking about the potential of various forms of assessment, Wiliam made the following comment: “There is no correct solution…there are only tradeoffs. And I want us to talk about the tradeoffs so that they are made explicitly rather than ending up with unintended consequences further down the line.”

Now, this was an unscripted comment, and the spontaneity did not fully capture the nuances of Wiliam’s point. Slightly restated, I think that he was offering the following key insight that some school and district leaders fail to understand:

There is no perfect solution—no single assessment tool that will do everything we need. There are only tradeoffs, because each form of assessment has strengths and weaknesses, and advantages and disadvantages. It’s important that we recognize these so that any tradeoffs are made explicitly and are fully understood, so we do not end up with unintended consequences for our students later on.

What does this look like in practice?

Performance tasks: The power of open-ended assessment items

As an example, consider the ongoing interest in and passion for open-ended assessment items such as performance tasks. This category was referred to during the ASU+GSV panel discussion as “rich item types,” a name that implies favorability in terms of the range of skills assessed and the items’ ability to engage students in deeper thinking.

For illustrative purposes, let’s take a look at the following performance task created by the New Standards Project, and let’s consider both its advantages and disadvantages:

Orbit 1 Space Station Theater: A Performance Task

Orbit 1 Space Station is currently in the final stages of planning. It will be the first large-scale space station capable of hosting sizeable groups. The station is round, and a central floor/deck is being reserved for use as a theatre and meeting space. The space reserved for this purpose is a perfect circle with a radius of 75 feet. It is to be designed as a “theater in the round” with a center stage, also round, that is to have a minimum diameter of 20 feet.

The desire is to maximize seating while meeting all safety requirements. Your goal is to design a theater/meeting space with as many seats as possible conforming to the following guidelines:

The minimum seat dimensions are 24 inches wide by 16 inches deep.

Each seat must have a minimum of 12 inches of leg room in front of it.

The theater must include 4 central, equally spaced aisles no less than 5 feet in width.

For safety purposes, there must be no more than 10 seats in any row without a secondary access aisle, no less than 2 feet in width.

Often, conversations about performance tasks such as this one focus on their advantages: They’re rigorous, they’re engaging for students, they focus on real-world applications, etc. But remember the critical insight from Wiliam: Each form of assessment has strengths and weaknesses, and advantages and disadvantages.

Discussing this dynamic during the presentation, Wiliam noted the following:

When you have rich item types, what’s crucial is whether the kid likes the item. Because, when you have rich items, they are longer. Therefore, you don’t get to ask but so many questions.

What we find is that when we look at performance assessments in science, for example, how good the kid is at science, who scored the items, how hard the items were—these things are less important than how lucky the student was in the particular set of performance tasks they were asked to complete. [Did they like the particular topic(s) through which the skills were applied?]

So, when you build in a lot of performance questions, you build in a massive amount of invisible unreliability, and it becomes a lottery.

To put this another way, students who have an interest in space, theaters, and/or architecture will have “won the lottery” with the Orbit 1 Space Station task, regardless of their math ability. Their natural interest in this particular setting for the application of the math concepts would create an “invisible unreliability.”

This might result in these students scoring higher on this item than students with an equal understanding of the concepts but with little or no interest in space stations or theaters.

Performance tasks: Advantages and disadvantages

Does this mean that performance tasks should never be used in K‒12 assessment? Certainly not.

Performance tasks’ advantage is that they assess the application of concepts at a deep level. This is something they do exceedingly well. But we also must acknowledge their disadvantages, including:

The amount of time they take to complete.
Their potential for “invisible unreliability.”

These are the “tradeoffs” that Wiliam wants us to thoughtfully consider, “rather than us ending up with unintended consequences further down the line.”

Wiliam does note that “the evidence is that once you get to about 6 or 7 [open-ended] tasks, then you really are assessing that student’s ability in science rather than just how lucky they were.” This is because the wider variety of tasks has washed out the unreliability that would have been present with only one task.

Again, this is a tradeoff that must be considered.

Assessments to drive student learning

Discover research-based assessment tools from Renaissance for literacy, math, and more.

Talk to an expert

Multiple-choice assessment items: Prioritizing efficiency and reliability

On the other end of the scale from performance tasks, we find the lowly multiple-choice item. This item type’s disadvantages are often noted, primarily that it generally cannot assess depth or rigor.

Considering multiple-choice items through the lens of Webb’s Depth of Knowledge (DOK), we recognize that they can consistently be written to assess DOK Levels 1 and 2 but, with rare exceptions, they cannot assess DOK Levels 3 or 4. For this reason, in some circles, they are summarily dismissed. But we should also note their advantages.

In comparison to most other item types, multiple-choice items are very short. Additionally, they have been researched extensively, and assessment authors know how to write them so that they are highly reliable.

In the time it would take students to complete the Orbit 1 Space Station performance task, they could easily have responded to numerous multiple-choice items. Might the information from these items provide us with more insight into students’ learning than a single performance task?

Again, this is something to consider when we make our tradeoffs.

Comprehensive assessment systems: The right tool at the right time

Despite the efficiency and documented reliability of multiple-choice items, I’m not suggesting that we rely solely upon them to meet our assessment needs. Similarly, despite their advantages around rigor and application, we also would not want to solely rely on performance tasks.

Looking beyond education, a comparison of different types of hammers can be helpful here. Take a moment to visualize:

A tack hammer.
A standard hammer.
A sledgehammer.
A jack hammer.

They all come from the same family, but they vary widely in terms of their application or grain size. You’d never use a sledgehammer to do a tack hammer’s job, or vice versa. In this case, it’s not about one tool being “better” or “worse” than another; it’s about using different tools for differing purposes, by trained users who know which tool to use when.

Similarly, educators should be supported in developing their assessment literacy so that they know which assessment tools to use when. This is sometimes a challenge. As Dr. Rick Stiggins remarks, “The preservice training provided to teachers on assessment is akin to training someone to become a physician without teaching them what tests to order or how to interpret the results.”

I’d also note that the settings in which different types of assessment are employed can impact their perceived efficacy. Consider, again, the Orbit 1 Space Station task:

Something of this length would never be tolerated in a universal screening tool, nor would it be able to reliably provide the necessary normative information (such as percentile ranks or student growth percentiles) that we expect from seasonal screening data.
However, at the classroom level and when used close to instruction, this performance task would be considered ideal, providing students with the opportunity to deeply engage with and apply mathematical concepts around area, diameter, etc.

My point is that without the change of a single character, assessment tools can be perceived as either ideal or inappropriate, depending on the setting.

The power of a comprehensive assessment system

So, if there’s no ideal assessment tool, then what should we do when faced with the many assessment needs that schools and districts have? The answer is simple: Stop looking for a single tool. Rather, look for a comprehensive assessment system.

It’s not about having one specialized tool—a single hammer. It’s about having a collection of varying tools that are purpose-built to meet the needs you have throughout the year.

That’s exactly what we’ve been assembling at Renaissance: A comprehensive assessment system that includes a variety of tools, varying in complexity and grain size, designed to work together in order to answer your essential questions. Our system includes:

Universal screening and progress monitoring with Star Assessments and FastBridge
Diagnostic assessment with the unique Star Phonics
Classroom formative, benchmark, and summative assessment with DnA
Non-academic assessment with Renaissance Fundamentals

As you think about your own school or district and your needs throughout the year, I encourage you to ask whether you have the right assessments in place to answer all of your questions about student learning. And for more insights on key topics in K‒12 assessment, I invite you to explore my full Masterclass series.

Learn more

To explore Renaissance’s comprehensive assessment tools for literacy, math, and non-academic factors, please reach out.

Request a demo