Moving on from rigor

This is a repost from Grading for Growth. I post there every other Monday (my colleague David Clark does the other Mondays) and usually repost here the next day. This one is a follow-up to a post from last week, that David and I c0-authored, in which we argued that the concept of academic rigor has no inherent meaning and therefore we need to find a better term to describe whatever it is we are talking about when we say "rigor". This post starts where that one ends.

Later this week I'll be posting a follow-up to this follow-up that describes some more of my thoughts about rigor.


Last week we dove into the idea of rigor and decided that it’s not a useful term to describe what we’re looking for in learner assessments, because it has so many potential definitions that it has no definition at all. Instead, it tends to become a pathway for injecting our biases into our teaching. We ended that post on a cliffhanger: We have a proposal for a replacement term.

That term is validity.

As David and I have asked others about “rigor”, and as I’ve examined my own beliefs about it, it became clear that despite the issues with that term, there is some kind of shared conception of academic quality underneath the bravado of “rigorous academics”. It’s hard to get a fix on it, but it seems like what we really want when we talk about “rigor”, is that we can trust the outcomes of our grades. When we give a high grade to a student, we want this to mean that the student actually learned what the course said they would learn, and learned it “well” (whatever that means). This is the primary concern behind grade inflation, behind courses with “low standards”, and so on — that the grade assigned doesn’t accurately reflect the learning that took place, or didn’t take place.

Well, that’s what validity means. And the benefit of using “validity” to describe academic environments over “rigor” is that validity is a well-understood methodological concept from social science research that is ubiquitous in that field, and even has a huge body of research just studying itself. It’s everything “rigor” is not. (One might say it’s a more rigorous approach to rigor.)

Grading as research

Every assessment we give our students is a mini-experiment whose purpose is to collect data on the “research question” of whether they learned something. Like all experiments, they can be designed well or poorly. The “research question” has to be focused and clear — I can’t give an assessment on whether my students “learned about discrete mathematics”. But I can design an assessment about whether they learned about how to solve recurrence relations or whether they understand how to construct truth tables.

Also like experiments, assessments are also subject to two kinds of error: Type I error where we encounter a false positive (the result of the assessment causes us to think that students really learned, but in fact they didn’t) and Type II error where we encounter a false negative (it looks like students didn’t learn, but they actually did). I’ve written before about how one-and-done testing greatly amplifies the probability of each kind of error, while non-traditional grading reduces it1.

In either case, we have issues with validity. We may get data from these “experiments” but it leads to false conclusions. This is what all of us teaching in higher education, whether or not we’re converts to alternative grading, want to avoid. An “academically rigorous class” is one where true learning, or lack of it, is faithfully indicated by the grades that are assigned. Having valid assessments therefore is good common ground for discussions about grading while talking about “rigor” only seems divisive.

The technical meaning of validity is wide-ranging and involves numerous flavors. I’m going to focus on two of those here.

Grading and construct validity

Construct validity is “the degree to which a test measures what it claims, or purports, to be measuring.” That word “test” in context means any kind of measurement; for us, it might literally be an in-class test or exam. But it really means assessment. So, do the assessments we give actually measure what we claim they measure? And how well?

But we need to back up first. What exactly are we claiming to measure when we give assessments?

I find it difficult to answer this question without using slippery words like “understanding” or “appreciation” or “knowledge”. But I still think this is correct: When we assess our learners, we want to see if they have “really learned” or “truly understood” something. Knowledge of how to solve recurrence relations; appreciation of how recursion can be used to model patterns; and so on. It’s exactly opposite what I preach about learning objectives. And I think this is OK, and one reason why teaching is hard. Our “construct” in learning is an abstraction like “knowledge” and “understanding”; while our assessments are how we concretely measure this abstract construct. We make clear, measurable learning objectives like “I can solve a recurrence relation” as an attempt to bridge the gap, to connect the test to the construct.

In fact, clear and measurable learning objectives are an essential piece of construct validity. If we are trying to measure a thing, then we have to clearly state what the thing is that we are trying to measure. If we write up assessments or assignments without linking those to criteria, then we’re attempting to access a pure abstraction — “know x”, “understand y”, etc. — and this is again ripe for bias and abuse.

There are several other ways to foul up the construct validity of an assessment or a grading scheme, including but not limited to:

  • Bias in the assessment itself. There’s an exercise in the Stewart Calculus book that reads: “Jason leaves Detroit at 2:00 PM and drives at a constant speed west along I-94. He passes Ann Arbor, 40 miles from Detroit, at 2:50 PM. Express the distance traveled in terms of the time elapsed.” I gave this exercise as homework once, and one of my international students replied, It’s not possible to answer this question because we don’t know how fast Ann is traveling. That’s a perfectly reasonable answer from someone who doesn’t realize “Ann Arbor” is a city, not a person. This exercise was intended to measure learners’ understanding of related rates problems, but what it really measured was their knowledge of Michigan geography. It had poor construct validity, in other words, because of the bias toward American citizens baked into the question.
  • Defining the criteria of the assessment too narrowly. This happens a lot when learning objectives are poorly defined. Overly narrow objectives can exclude a lot of relevant information, for example if define success in solving recurrence relations (generally) as “I can use the characteristic root method to solve a linear homogeneous second-order recurrence relation”. It can also happen if the criteria are not aimed properly, for example testing for knowledge of solving recurrence relations by asking students to do something only tangentially related to this construct.
  • Presence of confounding variables. This might be the most prominent of all threats and the biggest issue with one-and-done testing. Using a one-and-done timed exam as an experiment to see if a student learned a topic is vulnerable to a vast number of confounding variables: physical health, mental health, whether the learner brought a calculator, whether the busses were running on time, whether the learner is a native English speaker or not, and on and on. The time constraint amplifies all these issues.

So, an assessment has good construct validity if it accurately measures the construct of learning/knowledge/understanding. And a “rigorous” course is one where the assessments, and the grading scheme itself, have good construct validity — the assessments actually measure “real learning” and aren’t just fluff or busy work. Likewise a “non-rigorous” course is one where you just can’t trust or believe the results of assessments, for example an abstract algebra class where proofs are never assessed and the course grade is just based on participation; or the proofs are assessed but only using word count, or some other means that don’t measure “real understanding”.

Grading and criterion validity

Criterion validity by contrast is “the extent to which an operationalization of a construct, such as a test, relates to, or predicts, a theoretical representation of the construct—the criterion.” Or as this book puts it, “[c]riterion validity compares responses to future performance or to those obtained from other, more well-established surveys.”

Criterion validity can be broken into further subcategories, but I’d like to dwell on the overall concept of predictive accuracy for now. An assessment that has good criterion validity is one that accurately predicts future performance under “real” conditions. I go back to my colleague’s question that I wrote about last month. That colleague was questioning whether the use of specifications grading in abstract algebra might produce students who can’t do work on their own without significant help. I think this is a question about the criterion validity of my assessments and my system. Sure, students can get good grades now on their proofs; but what about when they get into graduate school?

It was and still is a fair question. I’d turn it around and say we should be asking this question about every assessment and every syllabus we see. So for example, you might have a grading system where the grades are based on three tests, a final, and some timed quizzes (and there’s no retaking of any of these). Students and profs might like this because it’s uncomplicated. But I would have serious questions about criterion validity. A student earns a 92% on the final exam; great, but does this actually predict anything about future performance? A student got a “B” in the class by earning 80% on everything, and never getting a single exam/quiz problem totally right; does this accurately predict “above average” success in other contexts, like graduate school or a job?

An assessment has good criterion validity if it does predict future results well. And we think of a “rigorous” course as one where most or all of the assessments, and the grading system itself, have good criterion validity. A “non-rigorous” course would be one where sure, a student can get a good grade now, but it doesn’t translate into success in graduate school, on the job, or even the next course in the sequence.

Another way to view criterion validity is whether the assessment compares well to the same construct being measured by a “gold standard” assessment — one that is known to have good criterion validity already. The “gold standard” I hear kicked around most often in my discipline (mathematics) is the oral exam. Sure, your students can get good grades if they are allowed infinite reattempts, but if I sit them down and drill them in person, how will they hold up? I’m not sure we all mean the same thing by an “oral exam” (are we actually assessing student learning or just trying to make them squirm?) but again, I think it’s a question worth asking for all assessments and systems: If you took students who succeeded in your assessment and grading scheme and them submitted them to fair, unbiased oral examinations, how would they do?

Where we’re going with this

Let me put my cards on the table here. I honestly believe with all objectivity that validity rather than rigor is the correct framework for thinking about assessments and grading. Do our assessments and grading schemes have validity? is a better question than Are they rigorous? because validity has scientific meaning whereas rigor does not.

But I also honestly believe, from the heart, that alternative grading systems have much greater validity, no matter how you view it, than traditional systems. No, I do not have data yet to back this up2. I’ve spent almost 30 years doing both in multiple contexts. When I look back on it, I simply have much more trust in the results of my specifications grading results, than I do in my traditional grading results. The assessments in specs grading, being criterion-referenced, are much more likely to accurately measure the construct they are intended to measure, and the fact that they’re graded on the results of feedback-focused iteration make them more believable as predictors of future results. Traditional grading, on the other hand, I never fully trusted, and that was a big reason why I ditched it when I did.

Ironically, if “rigor” really means “validity”, then this makes alternative-graded courses a lot more rigorous by definition than traditionally-graded courses. You have to work to overcome the lack of validity — the lack of rigor — inherent in traditional grading systems. And many of the efforts by a lot of self-important academics out there to “increase rigor” by simply making tests “harder”, imposing more restrictions on students, and so on actually make the course less rigorous because you lose trust in the validity of the assessments.

So let’s talk about validity instead of rigor from now on, and see where it leads us.


One last thing: While it’s not good form to hedge your bets when writing about something, I need to make clear that despite having some educational research under our belts, neither David nor I are actually social science researchers and are presenting this idea of validity as people learning the concept. (Maybe that’s obvious?) Anybody who is actually a social science researcher out there, is not only welcome to correct and clarify us — we encourage your doing so in the comments.