Report cards for teachers: Are they fair?

A new study underwritten by the Bill & Melinda Gates Foundation (which is among the funders of The Hechinger Report) tackles the question of whether the new teacher evaluation systems going into effect in school districts across the country are accurate and reliable in identifying which teachers are good and which are not. The researchers found that the new evaluation systems are likely to be more reliable than the methods used in the past. But they are not perfect.

The study argues that current teacher evaluation systems are broken, suggesting, as many critics have in the past, that the problem with the old approach was its failure to distinguish among the great, the mediocre and the bad: More than 90 percent of teachers were labeled as satisfactory, even in school districts where student achievement and graduation rates were abysmal. Under the old system, principals usually conducted one classroom observation per teacher every few years, marking off things on a checklist, like whether students were behaving and goals were displayed on the chalkboard.

Advocates for a new system of measuring teachers—with usually more than one classroom observation a year, plus standardized test scores that measure how much a teacher’s students improve academically, plus other measures like student survey results—say it is more consistent and precise. This new system is being promoted as more dependable in telling districts which teachers are great, which need help to get better, and, most controversially, which need to be let go.

The new method of grading teachers is still nascent, but it’s spreading rapidly, and the Gates study adds to a growing collection of data (in addition to anecdotal evidence we’ve been amassing during an ongoing reporting project here at Hechinger—stay tuned this month for the latest installment from Memphis, Tenn.) hinting at whether it will live up to its billing as a better way of measuring a teacher’s effectiveness.

That is, can we know for sure that a teacher who receives a top grade on one of the more rigorous and frequent classroom observations is also going to have a classroom of students who get top grades on achievement tests at the end of the year and on other important measures, like interest and happiness in school? (In response to criticisms of previous studies, the researchers expanded their study to look at indicators of success beyond test scores.) And are the evaluation measures, whether they are qualitative observations or quantitative test scores, accurate in labeling teachers great, ordinary, or bad?

In short, the researchers found that scores on observations did indeed tend to correlate with results on a variety of achievement tests. The correlations weren’t perfect, but researchers and proponents of the new evaluation systems say that’s because the two are not measuring the same thing: “There may be some teaching competencies that affect students in ways we are not measuring,” the study’s authors wrote.

In math, the correlation between what teachers did in their classrooms and how students performed at the end of the year on state tests was twice as strong as it was in reading, which researchers said could be a result of the tests, not the teachers. (And which could raise questions about the appropriateness of certain tests as tools to rate teachers, something critics of the new methods are already concerned about.)

The question of reliability, or whether we can count on ratings by different observers to be similar, is also central in the Gates report—part of an ongoing series looking at how new evaluation systems are working in a large sample of schools around the country.

“Reliability is important because without it classroom observations will paint an inaccurate portrait of teachers’ practice,” the report’s authors say. And an inaccurate portrait could lead to firing above-average teachers (or keeping on underperformers).

The study’s authors trained hundreds of evaluators to score videos of teachers teaching, and then compared them to see if the scores were consistent among the different raters. Here’s what they found:

“Even with systematic training and certification of observers, the MET project needed to combine scores from multiple raters and multiple lessons to achieve high levels of reliability. A teacher’s score varied considerably from lesson to lesson, as well as from observer to observer.”

In particular, a single observation by a single observer was more volatile and less reliable than multiple observations by different people.

Statistical measures using student test-score growth tend to be more reliable than observations, but the study and other research suggest that they are still wrong about a quarter of the time.

The good news is that the test-score measures and the observation scores seem to be more accurate and reliable when added together.

What does this mean for the real world?

Many teachers will be rated by one person, their principal. Some will receive classroom visits by two people, a principal and an assistant principal. In a few places, like D.C., the district has brought in outside evaluators to increase reliability. The new evaluations tend to include several observations over the course of a year, especially for new teachers. But in Tennessee, which is at the forefront of the new evaluation push, the number of times evaluators must be in classrooms each year has been reduced in response to complaints from districts and principals.

In most places that are adopting new evaluation systems, test scores, observations, survey results and other measures are all being combined, which the Gates study suggests will make the systems more trustworthy. Yet the vast majority of teachers do not teach in subjects or grades that are tested, meaning they will not receive value-added scores on their students’ growth.

So are the new evaluations likely to be ironclad? The report’s findings suggest not. Are they better than what existed before? The authors say yes: “Combining new approaches to measuring effective teaching—while not perfect—significantly outperforms traditional measures.”

For more reporting on the study, see Education Week’s take here, and the Los Angeles Times story here. American Federation of Teachers president Randi Weingarten also commented on the results. She was pleased that the report validated the union’s position that multiple measures should be used to evaluate teachers, but disappointed “that after all of the Gates Foundation’s research, the focus is still on measuring performance, not about improving performance.”

POSTED BY ON January 9, 2012