Why teaching evaluations might not be a good way to evaluate teaching

ANALYSIS: An arbitrator has ruled that Ryerson University will no longer be able to use student evaluations of teaching to make decisions about hiring and promotions — but the debate over SETs is far from over, writes Josh Dehaas
By Josh Dehaas - Published on Sep 12, 2018
Arbitrator William Kaplan has ordered that Ryerson’s collective agreement be changed to state that SETs will not be used “to measure teaching effectiveness for promotion or tenure.” (Tibor Kolley/Globe and Mail)



If you’ve been to university or college, you probably remember filling out bubble sheets or clicking through online surveys at the end of each semester and answering such questions as “On a scale of 1 to 5, with 1 being ‘strongly agree’ and 5 being ‘strongly disagree,’ how effective was the teacher?”

As boring as these student evaluations of teaching (SETs) may seem to students, they can make or break a professor. Some universities and colleges take the results, calculate average scores, and provide the numbers to the committees that decide who gets hired or promoted. An instructor with a 4.5/5 rating may have an edge over a peer with a 4.2/5.

Faculty unions in Canada have long argued against the practice. They say that SETs force instructors to go easy on students in the hopes of getting higher scores. They say students aren’t reliable judges of how much they’ve learned. They argue that SETs can be influenced by racial or gender biases. The numbers are flawed and shouldn’t factor into such important decisions as who gets a tenure-track job, the unions claim.

At least one well-regarded legal mind agrees. In order to settle a dispute between Ryerson University and its faculty association, arbitrator William Kaplan looked at testimony from experts on SETs and concluded that the evidence “establishes, with little ambiguity, that [this] key tool in assessing teaching effectiveness is flawed, while the use of averages is fundamentally and irreparably flawed.”

Kaplan has ordered that Ryerson’s collective agreement be changed to state that SETs will not be used “to measure teaching effectiveness for promotion or tenure.”

He did not say that Ryerson must stop collecting SET data, but he did indicate that it must be presented as a frequency distribution (how many people said they “strongly agree,” how many said they “agree,” etc.) rather than as an average and that those tasked with evaluating faculty must be educated “in inherent and systemic biases in SETs.”

Emma Phillips, one of the lawyers for the faculty association, says that although the ruling is binding only on Ryerson, the declaration that SETs may not be used in hiring or promotion at that institution could have wide-ranging implications for post-secondary instructors everywhere.

“What Kaplan found is that students are really not in a position to assess whether a professor is effective,” Phillips said.

Part of the testimony that Kaplan considered was a report by Philip Stark, a statistician at the University of California, Berkeley. Stark states that if an evaluation uses a 1 to signify “strongly disagree” and a 5 to signify “strongly agree,” attempting to come up with a score out of 5 is “statistically meaningless.” If a teacher gets a 2.5 rating, that could mean she was considered mediocre by the entire class, or it might mean she challenged her students and ended up loved by the hard workers and hated by the slackers. The average provides no way of telling who was a better teacher.

But regardless of how the results are presented, SETs are useful only if the questions actually measure how well a teacher is teaching. Stark isn’t convinced that they do.

First, he points to the possible biases. Studies have shown that students’ opinions of their teachers appear to be swayed by the grades that they expect, whether or not the material is heavy in math, the instructor’s gender, age, attractiveness, and race — even the physical condition of the classroom.

On top of that, Stark reviewed several recent studies and concluded that they “generally find weak or negative association between SET and instructor effectiveness.”

One of those studies is a 2016 meta-analysis (that is, a study of studies) by Bob Uttl, Carmela White, and Daniela Wong Gonzalez — all psychologists working at Canadian universities. Their analysis considered two main types of research. One type involved  students who had different instructors in different “sections” of the same course but wrote a common exam — the assumption being that if students can effectively judge how much they learned, those who rate their instructors higher would also have higher exam marks. The second type of studies tracked students’ grades over time to see whether those students who gave higher scores to teachers in early courses got higher grades in later courses, which is what one might expect to see if they had truly had better teachers early on. The researchers found that some of the first type of studies showed large and moderate correlations between SET scores and learning, while the second type showed “no or only minimal correlation.” In part because the latter studies had much bigger sample sizes, they concluded that there are “no significant correlations between the SET ratings and learning.”

Betsy Barre, who directs the Teaching and Learning Collaborative at Wake Forest University, in North Carolina, has spent years researching whether SETs actually measure student learning. While she acknowledges the potential for bias and admits that there isn’t yet enough evidence to conclude definitively that SETs measure how well teachers teach, she nonetheless believes they likely have some value. Barre sees common-exam studies as the gold standard, so she puts more weight than Uttl, White, and Gonzalez did on a 1981 meta-analysis by psychologist Peter Cohen that looked at 68 multi-section courses and found a moderately strong correlation between SET ratings and student achievement.

Barre says that teachers must be assessed somehow and that the other main method of judging teachers — having a peer observe a class — may be even less reliable.

“Peers can also be sexist and racist; peers can also be prioritizing their own boutique idea about what good teaching is,” she says. “And, unlike students, they’re often only coming into one class, so they don’t see what’s happening every day.”

That’s why Barre believes that SETs — along with peer evaluations and other measures — should still be used for hiring and promotion when written, administered, and interpreted using the best practices currently available, which include watching out for biases and being careful not to assume that a teacher with a 4.5 is better than a teacher with a 4.3 — such a difference may be so small that it’s meaningless.

In other words, don’t assume that the decision at Ryerson has settled the debate over SETs.

A man filming in The Agenda studio

Our journalism depends on you.

You can count on TVO to cover the stories others don’t—to fill the gaps in the ever-changing media landscape. But we can’t do this without you.

Thinking of your experience with tvo.org, how likely are you to recommend tvo.org to a friend or colleague?
Not at all Likely
Extremely Likely

Most recent in Politics